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Foreword 


Foreword from the First Edition 


When I first got a summer job at MIT's Project MAC almost 30 
years ago, I was delighted to be able to work with the DEC 
PDP-10 computer, which was more fun to program in assembly 
language than any other computer, bar none, because of its rich 
yet tractable set of instructions for performing bit tests, bit 
masking, field manipulation, and operations on integers. Though 
the PDP-10 has not been manufactured for quite some years, 
there remains a thriving cult of enthusiasts who keep old PDP-10 
hardware running and who run old PDP-10 software—entire 
operating systems and their applications—by using personal 
computers to simulate the PDP-10 instruction set. They even 
write new software; there is now at least one Web site with 
pages that are served up by a simulated PDP-10. (Come on, stop 
laughing—it's no sillier than keeping antique cars running.) 


I also enjoyed, in that summer of 1972, reading a brand-new 
MIT research memo called HAKMEM, a bizarre and eclectic 
potpourri of technical trivia.1 The subject matter ranged from 
electrical circuits to number theory, but what intrigued me most 
was its small catalog of ingenious little programming tricks. 
Each such gem would typically describe some plausible yet 
unusual operation on integers or bit strings (such as counting the 
1-bits in a word) that could easily be programmed using either a 
longish fixed sequence of machine instructions or a loop, and 
then show how the same thing might be done much more 
cleverly, using just four or three or two carefully chosen 
instructions whose interactions are not at all obvious until 
explained or fathomed. For me, devouring these little 
programming nuggets was like eating peanuts, or rather bonbons 
—] just couldn't stop—and there was a certain richness to them, 
a certain intellectual depth, elegance, even poetry. 


“Surely,” I thought, “there must be more of these," and 
indeed over the years I collected, and in some cases discovered, 
a few more. “There ought to be a book of them.” 


I was genuinely thrilled when I saw Hank Warren’s 
manuscript. He has systematically collected these little 


programming tricks, organized them thematically, and explained 
them clearly. While some of them may be described in terms of 
machine instructions, this is not a book only for assembly 
language programmers. The subject matter is basic structural 
relationships among integers and bit strings in a computer and 
efficient techniques for performing useful operations on them. 
These techniques are just as useful in the C or Java 
programming languages as they are in assembly language. 


Many books on algorithms and data structures teach 
complicated techniques for sorting and searching, for 
maintaining hash tables and binary trees, for dealing with 
records and pointers. They overlook what can be done with very 
tiny pieces of data—bits and arrays of bits. It is amazing what 
can be done with just binary addition and subtraction and 
maybe some bitwise operations; the fact that the carry chain 
allows a single bit to affect all the bits to its left makes addition 
a peculiarly powerful data manipulation operation in ways that 
are not widely appreciated. 


Yes, there ought to be a book about these techniques. Now it 
is in your hands, and it's terrific. If you write optimizing 
compilers or high-performance code, you must read this book. 
You otherwise might not use this bag of tricks every single day— 
but if you find yourself stuck in some situation where you 
apparently need to loop over the bits in a word, or to perform 
some operation on integers and it just seems harder to code than 
it ought, or you really need the inner loop of some integer or bit- 
fiddly computation to run twice as fast, then this is the place to 
look. Or maybe you'll just find yourself reading it straight 
through out of sheer pleasure. 


Guy L. Steele, Jr. 
Burlington, Massachusetts 
April 2002 


Preface 


Caveat Emptor: The cost of software 
maintenance increases with the square of 
the programmer's creativity. 


First Law of Programmer Creativity, 
Robert D. Bliss, 1992 


This is a collection of small programming tricks that I have come 
across over many years. Most of them will work only on 
computers that represent integers in two's-complement form. 
Although a 32-bit machine is assumed when the register length 
is relevant, most of the tricks are easily adapted to machines 
with other register sizes. 


This book does not deal with large tricks such as 
sophisticated sorting and compiler optimization techniques. 
Rather, it deals with small tricks that usually involve individual 
computer words or instructions, such as counting the number of 
l-bits in a word. Such tricks often use a mixture of arithmetic 
and logical instructions. 


It is assumed throughout that integer overflow interrupts 
have been masked off, so they cannot occur. C, Fortran, and 
even Java programs run in this environment, but Pascal and Ada 
users beware! 


The presentation is informal. Proofs are given only when the 
algorithm is not obvious, and sometimes not even then. The 
methods use computer arithmetic, *floor" functions, mixtures of 
arithmetic and logical operations, and so on. Proofs in this 
domain are often difficult and awkward to express. 


To reduce typographical errors and oversights, many of the 
algorithms have been executed. This is why they are given in a 
real programming language, even though, like every computer 
language, it has some ugly features. C is used for the high-level 
language because it is widely known, it allows the 
straightforward mixture of integer and bit-string operations, and 
C compilers that produce high-quality object code are available. 


Occasionally, machine language is used, employing a three- 
address format, mainly for ease of readability. The assembly 


language used is that of a fictitious machine that is 
representative of today's RISC computers. 


Branch-free code is favored, because on many computers, 
branches slow down instruction fetching and inhibit executing 
instructions in parallel. Another problem with branches is that 
they can inhibit compiler optimizations such as instruction 
scheduling, commoning, and register allocation. That is, the 
compiler may be more effective at these optimizations with a 
program that consists of a few large basic blocks rather than 
many small ones. 


The code sequences also tend to favor small immediate 
values, comparisons to zero (rather than to some other number), 
and instruction-level parallelism. Although much of the code 
would become more concise by using table lookups (from 
memory), this is not often mentioned. This is because loads are 
becoming more expensive relative to arithmetic instructions, and 
the table lookup methods are often not very interesting 
(although they are often practical). But there are exceptional 
cases. 


Finally, I should mention that the term "hacker" in the title is 
meant in the original sense of an aficionado of computers— 
someone who enjoys making computers do new things, or do old 
things in a new and clever way. The hacker is usually quite good 
at his craft, but may very well not be a professional computer 
programmer or designer. The hacker's work may be useful or 
may be just a game. As an example of the latter, more than one 
determined hacker has written a program which, when 
executed, writes out an exact copy of itself.1 This is the sense in 
which we use the term "hacker." If you're looking for tips on 
how to break into someone else's computer, you won't find them 
here. 
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Chapter 1. Introduction 


1-1 Notation 


This book distinguishes between mathematical expressions of 
ordinary arithmetic and those that describe the operation of a 
computer. In “computer arithmetic,” operands are bit strings, or 
bit vectors, of some definite fixed length. Expressions in 
computer arithmetic are similar to those of ordinary arithmetic, 
but the variables denote the contents of computer registers. The 
value of a computer arithmetic expression is simply a string of 
bits with no particular interpretation. An operator, however, 
interprets its operands in some particular way. For example, a 
comparison operator might interpret its operands as signed 
binary integers or as unsigned binary integers; our computer 
arithmetic notation uses distinct symbols to make the type of 
comparison clear. 


The main difference between computer arithmetic and 
ordinary arithmetic is that in computer arithmetic, the results of 
addition, subtraction, and multiplication are reduced modulo 2n, 
where n is the word size of the machine. Another difference is 
that computer arithmetic includes a large number of operations. 
In addition to the four basic arithmetic operations, computer 
arithmetic includes logical and, exclusive or, compare, shift left, 
and so on. 


Unless specified otherwise, the word size is 32 bits, and 
signed integers are represented in two's-complement form. 


Expressions of computer arithmetic are written similarly to 
those of ordinary arithmetic, except that the variables that 
denote the contents of computer registers are in bold face type. 
This convention is commonly used in vector algebra. We regard 
a computer word as a vector of single bits. Constants also appear 
in bold-face type when they denote the contents of a computer 
register. (This has no analogy with vector algebra because in 
vector algebra the only way to write a constant is to display the 
vectors components.) When a constant denotes part of an 
instruction, such as the immediate field of a shift instruction, 
light-face type is used. 


If an operator such as “+” has bold face operands, then that 
operator denotes the computer's addition operation (“vector 
addition"). If the operands are light-faced, then the operator 
denotes the ordinary scalar arithmetic operation. We use a light- 
faced variable x to denote the arithmetic value of a bold-faced 
variable x under an interpretation (signed or unsigned) that 
should be clear from the context. Thus, if x = 0x80000000 and 
y = 0x80000000, then, under signed integer interpretation, x 
= у = -231, x + у = - 232 and x + y = 0. Here, 
0x80000000 is hexadecimal notation for a bit string consisting 
of a 1-bit followed by 31 O-bits. 


Bits are numbered from the right, with the rightmost (least 
significant) bit being bit 0. The terms “bits,” “nibbles,” “bytes,” 
“halfwords,” “words,” and “doublewords” refer to lengths of 1, 
4, 8, 16, 32, and 64 bits, respectively. 


Short and simple sections of code are written in computer 
algebra, using its assignment operator (left arrow) and 
occasionally an if statement. In this role, computer algebra is 
serving as little more than a machine-independent way of 
writing assembly language code. 


Programs too long or complex for computer algebra are 
written in the C programming language, as defined by the ISO 
1999 standard. 


A complete description of C would be out of place in this 
book, but Table 1-1 contains a brief summary of most of the 
elements of C [H&S] that are used herein. This is provided for 
the benefit of the reader who is familiar with some procedural 
programming language, but not with C. Table 1-1 also shows 
the operators of our computer-algebraic arithmetic language. 
Operators are listed from highest precedence (tightest binding) 
to lowest. In the Precedence column, L means left-associative; 
that is, 


a*b*c- (а,Ь)*,с 


and R means right-associative. Our computer-algebraic notation 
follows C in precedence and associativity. 


TABLE 1-1. EXPRESSIONS OF C AND COMPUTER ALGEBR 


12L 
IIL 


ΠΕ 


IIL 
101. 


101. 


9L 


Elki) 


x++, x-- 
++х, --х 


(type name ) x 


Tx 


Ix 


-X 
x*y 


x/y 
x/y 


х%у 


х%у 


x+y, х - у 


x << у, х >> у 


х >> у 

х у, X <= y, 
х > y, >= у 
х < у, х <= у, 
х > у, 2-—y 


Computer 
Algebra 


κος) 
abs(x) 


nabs(x) 


ху 


rem(x, y) 


remu(x, у) 
mod(x, y) 


x+y, x-y 


xy, xy 
5 
xy 


ху, ху 
x«y,xtzy. 
x>y,x2y 
x5yxty, 
xZyxiy 


x = y, x= y 


Description 


Hexadecimal, binary constants 
Selecting the Ath component 


Different variables, or bit selection 
(clarified in text) 


Function evaluation 

Absolute value (but 

abs(-23!) = -231 ) 

Negative of the absolute value 
Postincrement, decrement 
Preincrement, decrement 

Type conversion 

x to the Ath power 

Bitwise not (one’s-complement) 


Logical not (if x = 0 then 1 
else 0) 


Arithmetic negation 
Multiplication, modulo word size 


Signed integer division 

Unsigned integer division 
Remainder (may be negative), of 
(x+y), signed arguments 
Remainder of x Z у, unsigned 
arguments 

x reduced modulo y to the interval 
10, abs(y) — 1] : signed arguments 
Addition, subtraction 

Shift left, right with 0-fill (“logi- 
cal" shifts) 

Shift right with sign-fill (^arithme- 
tic" or "algebraic" shift) 


Rotate shift left, right 
Signed comparison 


Unsigned comparison 


Equality, inequality 


8L X & y x&y Bitwise and 


TE x^ у xGy Bitwise exclusive or 

71. x=y Bitwise equivalence (-(х © y) ) 

6L х | y x | y Bitwise or 

5 L x && y x &y Conditional and (if x — 0 then 0 
else if y = 0 then 0 else 1) 

4L x || у x T y Conditional or (if x +0 then 1 
else if y #0 then 1 else 0) 

3L ΧΙ Concatenation 

2R х = у xey Assignment 


In addition to the notations described in Table 1-1, those of 
Boolean algebra and of standard mathematics are used, with 
explanations where necessary. 


Our computer algebra uses other functions in addition to 
abs,” “rem,” and so on. These are defined where introduced. 


сс 


In C, the expression х < у < z means to evaluate х < y toa 
0/1-valued result, and then compare that result to z. In 
computer algebra, the expression x « y « z means (x « y) & (y 
< 2). 


C has three loop control statements: while, do, and for. The 
while statement is written: 


while (expression) statement 


First, expression is evaluated. If true (nonzero), statement is 
executed and control returns to evaluate expression again. If 
expression is false (0), the while-loop terminates. 


The ao statement is similar, except the test is at the bottom 
of the loop. It is written: 


do statement while (expression) 


First, statement is executed, and then expression is evaluated. If 
true, the process is repeated, and if false, the loop terminates. 


The for statement is written: 
for (61: ео; ез) statement 


First, e1, usually an assignment statement, is executed. Then e», 


usually a comparison, is evaluated. If false, the for-loop 
terminates. If true, statement is executed. Finally, e3, usually an 
assignment statement, is executed, and control returns to 
evaluate e» again. Thus, the familiar “do i = 1 to n" is written: 


Click here to view code image 
for (i = 1; i <= n; i++) 


(This is one of the few contexts in which we use the 
postincrement operator.) 


The ISO C standard does not specify whether right shifts 
' operator) of signed quantities are O-propagating or sign- 
propagating. In the C code herein, it is assumed that if the left 
operand is signed, then a sign-propagating shift results (and if it 
is unsigned, then a O-propagating shift results, following ISO). 
Most modern C compilers work this way. 


“2 


It is assumed here that left shifts are “logical.” (Some 
machines, mostly older ones, provide an "arithmetic" left shift, 
in which the sign bit is retained.) 


Another potential problem with shifts is that the ISO C 
standard specifies that if the shift amount is negative or is 
greater than or equal to the width of the left operand, the result 
is undefined. But, nearly all 32-bit machines treat shift amounts 
modulo 32 or 64. The code herein relies on one of these 
behaviors; an explanation is given when the distinction is 
important. 


1-2 Instruction Set and Execution Time Model 


To permit a rough comparison of algorithms, we imagine them 
being coded for a machine with an instruction set similar to that 
of today's general purpose RISC computers, such as the IBM 
RS/6000, the Oracle SPARC, and the ARM architecture. The 
machine is three-address and has a fairly large number of 
general purpose registers—that is, 16 or more. Unless otherwise 
specified, the registers are 32 bits long. General register O 
contains a permanent 0, and the others can be used uniformly 
for any purpose. 


In the interest of simplicity there are no “special purpose” 
registers, such as a condition register or a register to hold status 
bits, such as “overflow.” The machine has no floating-point 


instructions. Floating-point is only a minor topic in this book, 
being mostly confined to Chapter 17. 


We recognize two varieties of RISC: a “basic RISC,” having 
the instructions shown in Table 1-2, and a “full RISC,” having 
all the instructions of the basic RISC, plus those shown in Table 
1-3. 


TABLE 1-2. BASIC RISC INSTRUCTION SET 


Opcode Mnemonic 


add, sub, mul, 
div, divu, rem, 
remu 


addi, muli 


addis 


and, or, xor 


andi, ori, xori 


beq, bne, blt, 
ble, bgt, bge 


bt, bf 


cmpeq, cmpne, 
cmplt, cmple, 
cmpgt, cmpge, 
cmpltu, cmpleu, 
cmpgtu, cmpgeu 


cmpieq, cmpine, 
cmpilt, cmpile, 
cmpigt, cmpige 


[Operands [wm | 


RT, RA, RB 


RT,RA,I 
RT,RA,RB 


RT,RA,Iu 


RT,target 


RT,target 
RT,RA,RB 


ВТ «БА op RB, where op is add, sub- 
tract, multiply, divide signed, divide 
unsigned, remainder signed, or remain- 
der unsigned. 


ВТ < RA op I, where op is add or 
multiply, and I is a 16-bit signed immedi- 
ate value. 


RT«-RA * (I || 0x0000). 


ВТ — RA op RB, where op is bitwise 
and, or, or exclusive or. 


As above, except the last operand is a 
16-bit unsigned immediate value. 


Branch to target if RT = 0, or if RT z 0, 
or if RT < 0, or if RT < 0, or if RT > 0, or 
if RT > 0 (signed integer interpretation 
of RT). 


Branch true/false; same as bne/beq resp. 


RT gets the result of comparing RA with 
RB; 0 if false and | if true. Mnemonics 
denote compare for equality, inequality, 
less than, and so on, as for the branch 
instructions; and in addition, the suffix 
“u” denotes an unsigned comparison. 
Like the cmpeq group, except the sec- 
ond comparand is a 16-bit signed imme- 
diate value. 


cmpiequ, cmpineu, 
cmpiltu, cmpileu, 
cmpigtu, cmpigeu 


ldbu, ldh, ldhu, 
ldw 


mulhs, mulhu 


not 


shl, shr, shrs 


shli, shri, shrsi 


stb, sth, stw 


RT,RA,Iu 


RT,d(RA) 


RT,RA, RB 


RT,RA 
RT,RA, RB 


RT,RA,Iu 


RS,d(RA) 


Like the cmp1tu group, except the sec- 
ond comparand is a 16-bit unsigned 
immediate value. 


Load an unsigned byte, signed halfword, 
unsigned halfword, or word into RT from 
memory at location RA + d, where d is 
a 16-bit signed immediate value. 


RT gets the high-order 32 bits of the prod- 
uct of RA and RB; signed and unsigned. 


RT < bitwise one's-complement of RA. 


RT < RA shifted left or right by the 
amount given in the rightmost six bits of 
RB; 0-fill except for shrs, which is 
sign-fill. (The shift amount is treated 
modulo 64.) 


RT < RA shifted left or right by the 
amount given in the 5-bit immediate field. 


Store a byte, halfword, or word, from RS 
into memory at location RA + а, where 
dis a 16-bit signed immediate value. 


TABLE 1-3. ADDITIONAL INSTRUCTIONS FOR THE “FULL RISC” 


abs, nabs 


andc, 
nor, 


eqv, 
ore 


nand, 


extr 


extrs 


ins 


nlz 


pop 


ldb 


moveq, movne, 


movlt, movle, 
movgt, movge 


RT,RA 


RT,RA,RB 


RT,RA,I,L 


RT,RA,I,L 
RT,RA,I,L 
RT,RA 


RT,RA 


RT,d(RA) 


RT,RA,RB 


RT gets the absolute value, or the nega- 
tive of the absolute value, of RA. 


Bitwise and with complement (of RB), 


equivalence, negative and, negative or, 
and or with complement. 


Extract bits I through I+L-1 of RA, 
and place them right-adjusted in RT, with 
0-fill. 

Like extr, but sign-fill. 


Insert bits 0 through L-1 of RA into bits 
I through I*L-1 of RT. 


RT gets the number of leading 0’s in RA 
(0 to 32). 

RT gets the number of I-bits in RA (0 to 
32). 


Load a signed byte into RT from memory 
at location RA + d, where d is a 16-bit 
signed immediate value. 


ВТ < RB if RA=0, or if RA = 0, and so 
on, else RT is unchanged. 


shlr, shrr RT,RA,RB ВТ < RA rotate-shifted left or right by 
the amount given in the rightmost five 
bits of RB. 

shlri, shrri RT,RA,Iu RT < RA rotate-shifted left or right by 
the amount given in the 5-bit immediate 
field. 

trpeq, trpne, RA,RB Trap (interrupt) if RA = RB, or RA z RB, 

trplt, trple, and so on. 

trpgt, trpge, 

trpltu, trpleu, 

trpgtu, trpgeu 

trpieq, trpine, RA,I Like the trpegq group, except the sec- 

trpilt, trpile, ond comparand is a 16-bit signed imme- 

trpigt, trpige diate value. 

trpiequ, trpineu, RA,Iu Like the trpltu group, except the sec- 

trpiltu, trpileu, ond comparand is a 16-bit unsigned 

trpigtu, trpigeu immediate value. 


In Tables 1-2, 1-3, and 1-4, RA and RB appearing as source 
operands really means the contents of those registers. 


A real machine would have branch and link (for subroutine 
calls), branch to the address contained in a register (for 
subroutine returns and “switches”, and possibly some 
instructions for dealing with special purpose registers. It would, 
of course, have a number of privileged instructions and 
instructions for calling on supervisor services. It might also have 
floating-point instructions. 


Some other computational instructions that a RISC computer 
might have are identified in Table 1-3. These are discussed in 
later chapters. 


It is convenient to provide the machine's assembler with a 
few "extended mnemonics." These are like macros whose 
expansion is usually a single instruction. Some possibilities are 
shown in Table 1-4. 


TABLE 1-4. EXTENDED MNEMONICS 


Extended 
Mnemonic Expansion 


beq R0O,target | Unconditional branch. 
See text Load immediate, —231 < 
ori RT,RA,0 Move register RA to RT. 


sub RT,RO,RA Negate (two's-complement). 


addi RT,RA,—I Subtract immediate (I = —215 ). 


The load immediate instruction expands into one or two 
instructions, as required by the immediate value I. For example, 
if 0 < I < 216, an or immediate (ori) from RO can be used. If — 
215 < I < 0, an add immediate (aadi) from RO can be used. If 
the rightmost 16 bits of I are 0, add immediate shifted (addis) can 
be used. Otherwise, two instructions are required, such as addis 
followed by ori. (Alternatively, in the last case, a load from 
memory could be used, but for execution time and space 
estimates we assume that two elementary arithmetic instructions 
are used.) 


Of course, which instructions belong in the basic RISC and 
which belong in the full RISC is very much a matter of 
judgment. Quite possibly, divide unsigned and the remainder 
instructions should be moved to the full RISC category. 
Conversely, possibly load byte signed should be in the basic RISC 
category. It is in the full RISC set because it is probably of rather 
low frequency of use, and because in some technologies it is 
difficult to propagate a sign bit through so many positions and 
still make cycle time. 


The distinction between basic and full RISC involves many 
other such questionable judgments, but we won't dwell on them. 


The instructions are limited to two source registers and one 
target, which simplifies the computer (e.g., the register file 
requires no more than two read ports and one write port). It also 
simplifies an optimizing compiler, because the compiler does not 
need to deal with instructions that have multiple targets. The 
price paid for this is that a program that wants both the quotient 
and remainder of two numbers (not uncommon) must execute 
two instructions (divide and remainder) The usual machine 
division algorithm produces the remainder as a by-product, so 
many machines make them both available as a result of one 
execution of divide. Similar remarks apply to obtaining the 
doubleword product of two words. 


The conditional move instructions (e.g., πιονεα) ostensibly 
have only two source operands, but in a sense they have three. 
Because the result of the instruction depends on the values in 
RT, RA, and RB, a machine that executes instructions out of 
order must treat RT in these instructions as both a use and a set. 
That is, an instruction that sets RT, followed by a conditional 
move that sets RT, must be executed in that order, and the result 


of the first instruction cannot be discarded. Thus, the designer of 
such a machine may elect to omit the conditional move 
instructions to avoid having to consider an instruction with 
(logically) three source operands. On the other hand, the 
conditional move instructions do save branches. 


Instruction formats are not relevant to the purposes of this 
book, but the full RISC instruction set described above, with 
floating-point and a few supervisory instructions added, can be 
implemented with 32-bit instructions on a machine with 32 
general purpose registers (5-bit register fields). By reducing the 
immediate fields of compare, load, store, and trap instructions to 
14 bits, the same holds for a machine with 64 general purpose 
registers (6-bit register fields). 


Execution Time 


We assume that all instructions execute in one cycle, except for 
the multiply, divide, and remainder instructions, for which we do 
not assume any particular execution time. Branches take one 
cycle whether they branch or fall through. 


The load immediate instruction is counted as one or two 
cycles, depending on whether one or two elementary arithmetic 
instructions are required to generate the constant in a register. 


Although load and store instructions are not often used in this 
book, we assume they take one cycle and ignore any load delay 
(time lapse between when a load instruction completes in the 
arithmetic unit and when the requested data is available for a 
subsequent instruction). 


However, knowing the number of cycles used by all the 
arithmetic and logical instructions is often insufficient for 
estimating the execution time of a program. Execution can be 
slowed substantially by load delays and by delays in fetching 
instructions. These delays, although very important and 
increasing in importance, are not discussed in this book. Another 
factor, one that improves execution time, is what is called 
“instruction-level parallelism,” which is found in many 
contemporary RISC chips, particularly those for “high-end” 
machines. 


These machines have multiple execution units and sufficient 
instruction-dispatching capability to execute instructions in 
parallel when they are independent (that is, when neither uses a 


result of the other, and they don't both set the same register or 
status bit). Because this capability is now quite common, the 
presence of independent operations is often pointed out in this 
book. Thus, we might say that such and such a formula can be 
coded in such a way that it requires eight instructions and 
executes in five cycles on a machine with unlimited instruction- 
level parallelism. This means that if the instructions are 
arranged in the proper order (“scheduled”), a machine with a 
sufficient number of adders, shifters, logical units, and registers 
can, in principle, execute the code in five cycles. 


We do not make too much of this, because machines differ 
greatly in their instruction-level parallelism capabilities. For 
example, an IBM RS/6000 processor from ca. 1992 has a three- 
input adder and can execute two consecutive add-type 
instructions in parallel even when one feeds the other (e.g., an 
add feeding a compare, or the base register of a load). As a 
contrary example, consider a simple computer, possibly for low- 
cost embedded applications, that has only one read port on its 
register file. Normally, this machine would take an extra cycle to 
do a second read of the register file for an instruction that has 
two register input operands. However, suppose it has a bypass so 
that if an instruction feeds an operand of the immediately 
following instruction, then that operand is available without 
reading the register file. On such a machine, it is actually 
advantageous if each instruction feeds the next—that is, if the 
code has no parallelism. 


Exercises 


1. Express the loop 
for (е1; €2; ез) statement 


in terms of a while loop. 
Can it be expressed as a ao loop? 


2. Code a loop in C in which the unsigned integer control 
variable i takes on all values from 0 to and including the 
maximum unsigned number, OxFFFFFFFF (on a 32-bit 
machine). 


3. For the more experienced reader: The instructions of the 
basic and full RISCs defined in this book can be executed 


with at most two register reads and one write. What are 
some common or plausible RISC instructions that either 
need more source operands or need to do more than one 
register write? 


Chapter 2. Basics 


2-1 Manipulating Rightmost Bits 
Some of the formulas in this section find application in later 
chapters. 
Use the following formula to turn off the rightmost 1-bit in a 
word, producing 0 if none (e.g., 01011000 = 01010000): 
x & (x - 1) 
This can be used to determine if an unsigned integer is a power 
of 2 or is 0: apply the formula followed by a 0-test on the result. 
Use the following formula to turn on the rightmost 0-bit in a 
word, producing all 178 if none (e.g., 10100111 5 10101111): 
x | (x + 1) 


Use the following formula to turn off the trailing 1’s in a 
word, producing x if none (e.g., 10100111 = 10100000): 


x & (x + 1) 


This can be used to determine if an unsigned integer is of the 
form 2n- 1, 0, or all 1’s: apply the formula followed by a 0-test 
on the result. 


Use the following formula to turn on the trailing O's in a 
word, producing x if none (e.g., 10101000 = 10101111): 
x | (x- 1) 


Use the following formula to create a word with a single 1-bit 
at the position of the rightmost 0-bit in x, producing 0 if none 
(e.g., 10100111 = 00001000): 


эх & (x + 1) 


Use the following formula to create a word with a single 0-bit 
at the position of the rightmost 1-bit in x, producing all 1’s if 
none (e.g., 10101000 = 11110111): 


^x | (х- 1) 


Use one of the following formulas to create a word with 1’s at 
the positions of the trailing 05 іп x, and 05 elsewhere, 
producing 0 if none (e.g., 01011000 = 00000111): 


Ax &(x-—1). or 
—(х | -х), or 


(x&-x)-1 
The first formula has some instruction-level parallelism. 
Use the following formula to create a word with 0’s at the 


positions of the trailing 1’s in x, and 0’s elsewhere, producing all 
1’s if none (e.g., 10100111 11111000): 


x | (x + 1) 


Use the following formula to isolate the rightmost 1-bit, 
producing 0 if none (e.g., 01011000 - 00001000): 


x & ( —x) 


Use the following formula to create a word with 1’s at the 
positions of the rightmost 1-bit and the trailing O's in x, 
producing all 1’s if no 1-bit, and the integer 1 if no trailing 0’s 
(e.g., 01011000 = 00001111): 


x @ (x — 1) 


Use the following formula to create a word with 1’s at the 
positions of the rightmost O-bit and the trailing 15 in x, 
producing all 1’s if no O-bit, and the integer 1 if no trailing 1’s 
(e.g., 01010111 = 00001111): 


x @ (x + 1) 
Use either of the following formulas to turn off the rightmost 
contiguous string of 1’s (e.g, 01011100 = = > 01000000) 
[Wood]: 


(((x | (x — 1)) + 1) & x), or 
(Qc & —x) + x)&x 


These can be used to determine if a nonnegative integer is of the 
form 2j — 2k for some j 2 k= 0: apply the formula followed by 
a O-test on the result. 


De Morgan's Laws Extended 


The logical identities known as De Morgan's laws can be thought 
of as distributing, or “multiplying in,” the not sign. This idea can 
be extended to apply to the expressions of this section, and a few 
more, as shown here. (The first two are De Morgan's laws.) 


x & y) = | 
(х | y) = Ax & y 
—(х+1) = —x-1 
-(Х-1) = +1 
πχ = x-1 
—(x@y) = —x@y = x=y 
—(x=y) = —x=y = хФу 
πχ ty) = —x-y 
-(x-y) = м ty 
As an example of the application of these formulas, —(х | —(х 
+ 1) = ox &--(x + 1) = ~x & ((x + 1)- 1) = —x & x = 0. 
Right-to-Left Computability Test 


There is a simple test to determine whether or not a given 
function can be implemented with a sequence of add's, subtract's, 
and's, or's, and not's [War]. We can, of course, expand the list 
with other instructions that can be composed from the basic list, 
such as shift left by a fixed amount (which is equivalent to a 
sequence of add's), or multiply. However, we exclude instructions 
that cannot be composed from the list. The test is contained in 
the following theorem. 


THEOREM. A function mapping words to words can be 
implemented with word-parallel add, subtract, and, or, and 
not instructions if and only if each bit of the result depends 
only on bits at and to the right of each input operand. 


That is, imagine trying to compute the rightmost bit of the 
result by looking only at the rightmost bit of each input 
operand. Then, try to compute the next bit to the left by looking 
only at the rightmost two bits of each input operand, and 
continue in this way. If you are successful in this, then the 
function can be computed with a sequence of add's, and's, and so 
on. If the function cannot be computed in this right-to-left 
manner, then it cannot be implemented with a sequence of such 
instructions. 


The interesting part of this is the latter statement, and it is 
simply the contra-positive of the observation that the functions 
add, subtract, and, or, and not can all be computed in the right- 
to-left manner, so any combination of them must have this 
property. 

To see the “if” part of the theorem, we need a construction 
that is a little awkward to explain. We illustrate it with a specific 
example. Suppose that a function of two variables x and y has 
the right-to-left computability property, and suppose that bit 2 
of the result r is given by 


n = х) | (xo & yj). (1) 


We number bits from right to left, O to 31. Because bit 2 of the 
result is a function of bits at and to the right of bit 2 of the input 
operands, bit 2 of the result is “right-to-left computable.” 


Arrange the computer words x, x shifted left two, and y 
shifted left one, as shown below. Also, add a mask that isolates 
bit 2. 

Хз X30 ... X3 X2 X| Хо 
X59 Xag ... Ху Χρ Ü 0 
Узо Уо :.. V2 Yı Vo 0 
D. D. .g L0 9 
ο ο ug ru 


Now, form the word-parallel and of lines 2 and 3, or the result 
with row 1 (following Equation (1)), and and the result with the 
mask (row 4 above). The result is a word of all 0’s except for the 
desired result bit in position 2. Perform similar computations for 


the other bits of the result, or the 32 resulting words together, 
and the result is the desired function. 


This construction does not yield an efficient program; rather, 
it merely shows that it can be done with instructions in the basic 
list. 


Using the theorem, we immediately see that there is no 
sequence of such instructions that turns off the leftmost 1-bit in 
a word, because to see if a certain 1-bit should be turned off, we 
must look to the left to see if it is the leftmost one. Similarly, 
there can be no such sequence for performing a right shift, or a 
rotate shift, or a left shift by a variable amount, or for counting 
the number of trailing 0’s in a word (to count trailing 0’s, the 
rightmost bit of the result will be 1 if there are an odd number 
of trailing 05, and we must look to the left of the rightmost 
position to determine that). 


A Novel Application 


An application of the sort of bit twiddling discussed above is the 
problem of finding the next higher number after a given number 
that has the same number of 1-bits. You might very well wonder 
why anyone would want to compute that. It has application 
where bit strings are used to represent subsets. The possible 
members of a set are listed in a linear array, and a subset is 
represented by a word or sequence of words in which bit i is on 
if member i is in the subset. Set unions are computed by the 
logical or of the bit strings, intersections by and's, and so on. 


You might want to iterate through all the subsets of a given 
size. This is easily done if you have a function that maps a given 
subset to the next higher number (interpreting the subset string 
as an integer) with the same number of 1-bits. 


A concise algorithm for this operation was devised by R. W. 
Gosper [HAK, item 175].1 Given a word x that represents a 
subset, the idea is to find the rightmost contiguous group of 1’s 
in x and the following 0’s, and “increment” that quantity to the 
next value that has the same number of 1’s. For example, the 
string xxxO 1111 0000, where xxx represents arbitrary bits, 
becomes xxx1 0000 0111. The algorithm first identifies the 
"smallest" 1-bit in x, with s — x &-x, giving 0000 0001 0000. 
This is added to x, giving r — xxx1 0000 0000. The 1-bit here is 
one bit of the result. For the other bits, we need to produce a 


right-adjusted string of n — 1 1’s, where n is the size of the 
rightmost group of 1’s in x. This can be done by first forming the 
exclusive or of r and x, which gives 0001 1111 0000 in our 
example. 

This has two too many 1’s and needs to be right-adjusted. 
This can be accomplished by dividing it by s, which right-adjusts 
it (s is a power of 2), and shifting it right two more positions to 
discard the two unwanted bits. The final result is the or of this 
and r. 


In computer algebra notation, the result is y in 


$«x&-x 

rest+x (2) 
yer | (((x @ r) > 2) 1 s) 

A complete C procedure is given in Figure 2-1. It executes in 


seven basic RISC instructions, one of which is division. (Do not 
use this procedure with x = 0; that causes division by 0.) 


If division is slow but you have a fast way to compute the 
number of trailing zeros function ntz(x), the number of leading 
zeros function nlz(x), or population count (pop(x) is the number 
of 1-bits in x), then the last line of Equation (2) can be replaced 
with one of the following formulas. (The first two methods can 
fail on a machine that has modulo 32 shifts.) 


yer | ((xGr) >> (2 + ntz(x))) 
yer | (x @ r) > (33 - nlz(s))) 


yer | (d < (рор(х @ r) - 2)) - 1) 
Click here to view code image 
unsigned snoob(unsigned x) { 


unsigned smallest, ripple, ones; 
// x = xxx0 1111 0000 


smallest = x & -x; // 0000 0001 0000 
ripple = x + smallest; // xxx1 0000 0000 
ones = x ^ ripple; / / 0001 1111 0000 
ones = (ones >> 2)/smallest; // 0000 0000 0111 


return ripple | ones; / / ххх1 0000 0111 


FIGURE 2-1. Next higher number with same number of 1- 
bits. 


2-2 Addition Combined with Logical Operations 


We assume the reader is familiar with the elementary identities 
of ordinary algebra and Boolean algebra. Below is a selection of 
similar identities involving addition and subtraction combined 
with logical operations. 


a — = =x+1 

b w ==) 

с. Ax = -x-1 

d -ax =х+1 

e ——x = х—1 

f. xty 7"x—--y-1 

g. = (xy) + 2(x & y) 
h -(x|y)yt(x&y) 

i. = 2(x | y) - (x € y) 
i: x-y =x+-yt1 

k. = (x y) — 2(—x & у) 
l. = (x & Ay) - (2x & y) 
m. = 2(x & —y)— (x @ y) 
n. xy = (x | y)- (x&y) 

0 x&-y = (x | y))—y 

p. = x-(x & y) 

9 —(x-y) =у-х-1 

r. nin x А 

s. хту = (x&y)-(x| y)-1 
t. = (x & y) + —(x | y) 
u. x | y = (x & —y) + y 

v. x&y = (~x | y) -— ax 


Equation (d) can be applied to itself repeatedly, giving ———x 
= x + 2, and so on. Similarly, from (е) we have —— x = x- 
2. So we can add or subtract any constant using only the two 
forms of complementation. 


Equation (f) is the dual of (j), where (j) is the well-known 


relation that shows how to build a subtracter from an adder. 


Equations (g) and (h) are from HAKMEM memo [HAK, item 
23]. Equation (g) forms a sum by first computing the sum with 
carries ignored (x Ф y), and then adding in the carries. Equation 
(h) is simply modifying the addition operands so that the 
combination 0 + 1 never occurs at any bit position; it is 
replaced with 1 + 0. 


It can be shown that in the ordinary addition of binary 
numbers with each bit independently equally likely to be 0 or 1, 
a carry occurs at each position with probability about 0.5. 
However, for an adder built by preconditioning the inputs using 
(g), the probability is about 0.25. This observation is probably 
not of value in building an adder, because for that purpose the 
important characteristic is the maximum number of logic circuits 
the carry must pass through, and using (g) reduces the number 
of stages the carry propagates through by only one. 


Equations (К) and (1) are duals of (g) and (В), for subtraction. 
That is, (k) has the interpretation of first forming the difference 
ignoring the borrows (x @ y), and then subtracting the borrows. 
Similarly, Equation (l) is simply modifying the subtraction 
operands so that the combination 1 — 1 never occurs at any bit 
position; it is replaced with 0 - 0. 

Equation (n) shows how to implement exclusive or in only 
three instructions on a basic RISC. Using only and-or-not logic 
requires four instructions ((x | y) & ^(x & y)). Similarly, (u) and 
(v) show how to implement and and or in three other elementary 
instructions, whereas using DeMorgan's laws requires four. 


2-3 Inequalities among Logical and Arithmetic 
Expressions 


Inequalities among binary logical expressions whose values are 
interpreted as unsigned integers are nearly trivial to derive. Here 
are two examples: 


(хФу) (х | y) and 
(x & y) < (х = y). 


These can be derived from a list of all binary logical operations, 


shown in Table 2-1. 


TABLE 2-1. THE 16 BINARY LOGICAL OPERATIONS 


Let f(x, y) and g(x, y) represent two columns in Table 2-1. If 
for each row in which f(x,y) is 1, g(x,y) also is 1, then for all 


(x,y), ία γ) < g(x, y). Clearly, this extends to word-parallel 
logical operations. One can easily read off such relations (most 


of which are trivial) as (x & y) < x < (x | ^ y), and so on. 
Furthermore, if two columns have a row in which one entry is 0 
and the other is 1, and another row in which the entries are 1 
and 0, respectively, then no inequality relation exists between 
the corresponding logical expressions. So the question of 
whether or not f(x, y) < g(x, y) is completely and easily solved 
for all binary logical functions f and g. 

Use caution when manipulating these relations. For example, 
for ordinary arithmetic, if x + y < a and z < x, then z + y < 
a, but this inference is not valid if “+” is replaced with or. 


Inequalities involving mixed logical and arithmetic 
expressions are more interesting. Below is a small selection. 


a (x | y) max(x, y) 

b. (x & y) < min(x, y) 

с. (x | y) <х + y ifthe addition does not overflow 
d. (x | y) 2 х+у if the addition overflows 


e. Ix-y| < (x ey) 
The proofs of these are quite simple, except possibly for the 


relation |x — y| < (x @ y). By |x — y|. we mean the absolute 
value of x — y, which can be computed within the domain of 
unsigned numbers as max(x, y) — min(x, y). This relation can 
be proven by induction on the length of x and y (the proof is a 


little easier if you extend them on the left rather than on the 
right). 


2-4 Absolute Value Function 


If your machine does not have an instruction for computing the 
absolute value, this computation can usually be done in three or 


four branch-free instructions. First, compute У <*> 31, and 
then one of the following: 


abs nabs 
(x 9 y)-y у-(хӨу) 
(х+у) Фу (у-х) Фу 


x — (2x & y) (2x & y) - x 
By “2x” we mean, of course, x + x or x «« 1. 
If you have fast multiplication by a variable whose value is 


+1, the following will do: 
(Cx > 30) | 1) * x 
2—5 Average of Two Integers 


The following formula can be used to compute the average of 
two unsigned integers, | (x + y)/2, without causing overflow 
[Dietz]: 


(x & y) + ((x Фу) > | ) (3) 


The formula below computes Г(х + y)/21 for unsigned integers: 


у)-((хФу)> 1) 

To compute the same quantities (“floor and ceiling averages”) 
for signed integers, use the same formulas, but with the 
unsigned shift replaced with a signed shift. 


(x 


For signed integers, one might also want the average with the 
division by 2 rounded toward 0. Computing this “truncated 
average” (without causing overflow) is a little more difficult. It 
can be done by computing the floor average and then correcting 


it. The correction is to add 1 if, arithmetically, x + y is negative 
and odd. But x + y is negative if and only if the result of (3), 
with the unsigned shift replaced with a signed shift, is negative. 
This leads to the following method (seven instructions on the 
basic RISC, after commoning the subexpression х ΦΥ): 


t — (x & y) + ((x Ө y) > 1); 
t + ((t Z 31) & (x € y)) 


Some common special cases can be done more efficiently. If x 
and y are signed integers and known to be nonnegative, then the 
4 
average can be computed as simply (X + J) >> 1. The sum can 
overflow, but the overflow bit is retained in the register that 
holds the sum, so that the unsigned shift moves the overflow bit 
to the proper position and supplies a zero sign bit. 


„и A I 
If x and y are unsigned integers and X SJ, or if x and y are 
signed integers and x < y (signed comparison), then the average 


AN 
is given by x +  —X) >> | These are floor averages, for 
example, the average of —1 and 0 is — 1. 


2-6 Sign Extension 


By "sign extension," we mean to consider a certain bit position 
in a word to be the sign bit, and we wish to propagate that to 
the left, ignoring any other bits present. The standard way to do 
this is with shift left logical followed by shift right signed. 
However, if these instructions are slow or nonexistent on your 
machine, it can be done with one of the following, where we 
illustrate by propagating bit position 7 to the left: 


((x + 0x00000080) & 0x000000FF) — 0x00000080 
((x & 0x000000FF) © 0x00000080) — 0x00000080 
(x & 0х0000007Е) — (x & 0x00000080) 
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The “+” above can also be or “Ф.” The second formula is 
particularly useful if you know that the unwanted high-order 
bits are all 055, because then the and can be omitted. 


2-7 Shift Right Signed from Unsigned 


If your machine does not have the shift right signed instruction, it 
can be computed using the formulas shown below. The first 
formula is from [GM], and the second is based on the same idea. 
These formulas hold for 0 x n x 31 and, if the machine has 
mod-64 shifts, the last holds for 0 < n x 63. The last formula 
holds for any n if by “holds” we mean “treats the shift amount to 
the same modulus as does the logical shift." 


When n is a variable, each formula requires five or six 
instructions on a basic RISC. 


((х + 0x80000000) = л) — (0x80000000 5 л) 
t «— 0x80000000 — n; ((xn)@t)-t 
t — (x & 0х80000000) >> n: (x 5 n) – (t + f) 
(xn) | C(x 31) < 31 п) 


t — —(x >> 31): ((x@t)n) Ot 
In the first two formulas, an alternative for the expression 
0x80000000 — 7 is 1<<31 — n. 

If n is a constant, the first two formulas require only three 
instructions on many machines. If n — 31, the function can be 


. : : : „3 31 
done in two instructions with _(x> 3 ). 
2-8 Sign Function 


The sign, or signum, function is defined by 


|ГЪ х< 0, 
signx) = у 0, x=0, 
| hn dou. 


It can be calculated with four instructions on most machines 
[Hop]: 


(x > 31) | (—x > 31) 


If you don't have shift right signed, then use the substitute 
noted at the end of Section 2-7, giving the following nicely 
symmetric formula (five instructions): 


—(x $ 31) | (-* $31) 
Comparison predicate instructions permit a three-instruction 
solution, with either 


(x > 0)—(x <0), or 
(x 20)-(x <0). 


u 
Finally, we note that the formula 3 —х > 31 )— (х >> 31) 
almost works; it fails only for x = -251 


(4) 


2-9 Three-Valued Compare Function 


The three-valued compare function, a slight generalization of the 
sign function, is defined by 


-1, х«у, 
cmp(x, y) = 0, xy, 
99822 


There are both signed and unsigned versions, and unless 
otherwise specified, this section applies to both. 


Comparison predicate instructions permit a three-instruction 
solution, an obvious generalization of Equations in (4): 


(x»y)-(x«y). 
(xzy)-(xsSy). 


A solution for unsigned integers on PowerPC is shown below 
[CWG]. On this machine, “carry” is “not borrow." 


Click here to view code image 


subf R5,Ry,Rx # R5 «-- Rx - Ry. 
subfc R6,Rx,Ry # R6 «-- Ry - Rx, set carry. 
subfe R7,Ry,Rx # R7 «-- Rx - Ry + carry, set carry. 


subfe R8,R7,R5 # R8 «-- R5 - R7 + carry, (set 


carry). 


If limited to the instructions of the basic RISC, there does not 
seem to be any particularly good way to compute this function. 
The comparison predicates x < y, x < y, and so on, require 
about five instructions (see Section 2-12), leading to a solution 
in about 12 instructions (using a small amount of commonality 
in computing x « y and x > y). On the basic RISC it's probably 
preferable to use compares and branches (six instructions 
executed worst case if compares can be commoned). 


2-10 Transfer of Sign Function 

The transfer of sign function, called ISIGN in Fortran, is defined 
by 

| abs(x), у>0, 

|-abs(x), у< 0. 


This function can be calculated (modulo 232) with four 
instructions on most machines: 


ISIGN(x, y) = 


t «— y> 31; t€ (x @y)> 31: 
ISIGN(x, y) = (abs(x) Θ 0 - f ISIGN(x, y) = (x @t)-t 
= (abs(x) + f) Ot = (x +t) @# 


2-11 Decoding a “Zero Means 2**n” Field 


Sometimes a 0 or negative value does not make much sense for a 
quantity, so it is encoded in an n-bit field with a 0 value being 
understood to mean 2n, and a nonzero value having its normal 
binary interpretation. An example is the length field of 
PowerPC's load string word immediate (1swi) instruction, which 
occupies five bits. It is not useful to have an instruction that 
loads zero bytes when the length is an immediate quantity, but 
it is definitely useful to be able to load 32 bytes. The length field 
could be encoded with values from 0 to 31 denoting lengths 
from 1 to 32, but the “zero means 32” convention results in 
simpler logic when the processor must also support a 
corresponding instruction with a variable (in-register) length 
that employs straight binary encoding (e.g. PowerPC's iswx 
instruction). 


It is trivial to encode an integer in the range 1 to 2n into the 
“zero means 2n" encoding—simply mask the integer with 2n — 
1. To do the decoding without a test-and-branch is not quite as 
simple, but here are some possibilities, illustrated for a 3-bit 
field. They all require three instructions, not counting possible 
loads of constants. 


((x—1) & 7)*1 ((x+7) | -8) +9 8—(-x & 7) 
((х+7) &7) +1 ((х +7) | 8)-7 -(-х | -8) 
((x-1) | -8)*9 ((x-1)&8)tx 

2-12 Comparison Predicates 


A "comparison predicate" is a function that compares two 
quantities, producing a single bit result of 1 if the comparison is 
true, and 0 if the comparison is false. Below we show branch- 
free expressions to evaluate the result into the sign position. To 
produce the 1/0 value used by some languages (e.g., C), follow 
the code with a shift right of 31. To produce the — 1/0 result 
used by some other languages (e.g., Basic), follow the code with 
a shift right signed of 31. 


These formulas are, of course, not of interest on machines 
such as MIPS and our model RISC, which have comparison 
instructions that compute many of these predicates directly, 
placing a 0/1-valued result in a general purpose register. 


x = y: abs(x — у) - 1 
abs(x — y + 0x80000000) 
nlz(x — y) « 26 
—(nlz(x — y) > 5) 


3(x-y|y-x) 


x#y: nabs(x — y) 
nlz(x — y) - 32 
х-у|у-х 
х<у: (x - y) Ө [(x Фу) & ((х-у) 9 х)] 
(x & my) | (xy) & (x - )) 
nabs(doz(y, x)) [GSO] 
x<y: (x | ay) & (Xx @y) | —(y —x)) 
((х=у) 1) * (x & Sy) [GSO] 
wey: (2x & y) | ((x zy) & (x - y) 
(2x & y) | (ax | y) & (x —y)) 
х 5у: (nx |у) & ((хФу) | =Q —x)) 


A machine instruction that computes the negative of the 
absolute value is handy here. We show this function as “nabs.” 
Unlike absolute value, it is well defined in that it never 
overflows. Machines that do not have nabs, but have the more 
usual abs, can use —abs(x) for nabs(x). If x is the maximum 
negative number, this overflows twice, but the result is correct. 
(We assume that the absolute value and the negation of the 
maximum negative number is itself.) Because some machines 
have neither abs nor nabs, we give an alternative that does not 
use them. 


The “nlz” function is the number of leading 0’s in its 
argument. The “doz” function (difference or zero) is described on 
page 41. For x > y, x = y, and so on, interchange x and y in 
the formulas for x < y, x < y, and so on. The add of 0x8000 
0000 can be replaced with any instruction that inverts the high- 


order bit (in x, y, or x — y). 


Another class of formulas can be derived from the 
observation that the predicate x « y is given by the sign of x/2 
— y/2, and the subtraction in that expression cannot overflow. 
The result can be fixed up by subtracting 1 in the cases in which 
the shifts discard essential information, as follows: 


X < y: (x > | ) = (y EN | ) — (^x & y & 1 ) 


" .. 1 1 

x <J: (x»1)-(1)-(2x&y&l1) 
These execute in seven instructions on most machines (six if it 
has and not), which is no better than what we have above (five 


to seven instructions, depending upon the fullness of the set of 
logic instructions). 


The formulas above involving nlz are due to [Shep], and his 
formula for the x — y predicate is particularly useful, because a 
minor variation of it gets the predicate evaluated to a 1/0- 
valued result with only three instructions: 


nlz(x -) >> 2i 
Signed comparisons to 0 are frequent enough to deserve 
special mention. There are some formulas for these, mostly 
derived directly from the above. Again, the result is in the sign 
position. 


x70: abs(x) — 1 
abs(x + 0x80000000) 
nlz(x) =< 26 
—(nlz(x) >> 5) 
—(х | —х) 
—х & (х-1) 
xz0: nabs(x) 
nlz(x) — 32 


x | —x 
(х5 Ί)-α [CWG] 
хай: х 
x<0: x | (x- 1) 
Х | --х 
x»0: x Ф nabs(x) 
(x >> 1)-x 
—x & AX 
x20: ax 


Signed comparisons can be obtained from their unsigned 
counterparts by biasing the signed operands upward by 23! and 
interpreting the results as unsigned integers. The reverse 
transformation also works.2 Thus, we have 


kv) = ers «ταν, 


x£y-x-23!«y-23, 


u 
Similar relations hold for <, €, and so on. In these relations, 


one can use addition, subtraction, or exclusive or with 231. They 
are all equivalent, as they simply invert the sign bit. An 
instruction like the basic RISC's add immediate shifted is useful to 
avoid loading the constant 231. 


Another way to get signed comparisons from unsigned is 
based on the fact that if x and y have the same sign, then 
P j= Ἢ Ш , . . : 
X <J = X<Y, whereas if they have opposite signs, then 


X <) = X > J [Lamp]. Again, the reverse transformation also 
works, so we have 


x«y = (x £y) Өх, Фу, and 


х<у = (x<y)@ x, Өр», 
where x31 and ya; are the sign bits of x and y, respectively. 


Similar relations hold for <, 5 and so on. 

Using either of these devices enables computing all the usual 
comparison predicates other than = and = in terms of any one 
of them, with at most three additional instructions on most 
machines. For example, let us take X < У as primitive, because it 
is one of the simplest to implement (it is the carry bit from y — 
x). Then the other predicates can be obtained as follows: 


| 


х<у = —( +23! < х+231) 
х< у = x + 231 < y + 231 


= —(x+23 < у+231) 


х 
v 
- 

| 


a > y = у+ 23! © oho το 231 


х>у = —(X< yp) 
хїузух 
Comparison Predicates from the Carry Bit 


If the machine can easily deliver the carry bit into a general 
purpose register, this may permit concise code for some of the 


comparison predicates. Below are several of these relations. The 
notation carry(expression) means the carry bit generated by the 
outermost operation in expression. We assume the carry bit for 
the subtraction x — y is what comes out of the adder for x + J 
+ 1, which is the complement of “borrow.” 


x=y: carry(O — (x — y)). or carry((x + p) + 1), or 
carry((x — y — 1) + 1) 

X zy: carry((x — y) — 1), i.e., carry((x — y) + (-1)) 

X «y: -carry((x + 23!) — (y 231). or ^carry(x — y) © x3, Фу; 

х<у: carry((y + 231) — (x + 231)), or carry(y — x) Өх, Фу, 

х Š y: —сагту(х — y) 

х<у: carry(y — x) 

x=0: саггу(0 — x), or саггу(х + 1) 

х= 0: саггу(х — 1), i.e., саггу(х + (-1)) 

x «0: carry(x + x) 

х< 0: carry(23! — (x + 231)) 


For x > y, use the complement of the expression for x < y, and 
similarly for other relations involving “greater than.” 


The GNU Superoptimizer has been applied to the problem of 
computing predicate expressions on the IBM RS/6000 computer 
and its close relative PowerPC [GK]. The RS/6000 has 
instructions for abs(x), nabs(x), doz(x, y), and a number of 
forms of add and subtract that use the carry bit. It was found that 
the RS/6000 can compute all the integer predicate expressions 
with three or fewer elementary (one-cycle) instructions, a result 
that surprised even the architects of the machine. “All” includes 
the six two-operand signed comparisons and the four two- 
operand unsigned comparisons, all of these with the second 
operand being 0, and all in forms that produce a 1/0 result or a 
—1/0 result. PowerPC, which lacks abs(x), nabs(x), and doz(x, 
y), can compute all the predicate expressions in four or fewer 
elementary instructions. 


How the Computer Sets the Comparison Predicates 


Most computers have a way of evaluating the integer 
comparison predicates to a 1-bit result. The result bit may be 
placed in a “condition register” or, for some machines (such as 


our RISC model), in a general purpose register. In either case, 
the facility is often implemented by subtracting the comparison 
operands and then performing a small amount of logic on the 
result bits to determine the 1-bit comparison result. 


Below is the logic for these operations. It is assumed that the 


machine computes x — y as x + У + 1, and the following 
quantities are available in the result: 


Co, the carry out of the high-order position 
Cj, the carry into the high-order position 
N, the sign bit of the result 


Z, which equals 1 if the result, exclusive of Co, is all-0, and 
is otherwise 0 


Then we have the following in Boolean algebra notation 
(juxtaposition denotes and, + denotes or): 


V: СӨС, (signed overflow) 
x=y Z 
x= y: Z 
x«y МФУ 
х<ў (N@®V)+Z 
Xy (N = V)Z 
х> у Nzl 
x £ y: p 
x< y c. +7 
xy: 2.2 
x 2 y: C 


2-13 Overflow Detection 


“Overflow” means that the result of an arithmetic operation is 
too large or too small to be correctly represented in the target 
register. This section discusses methods that a programmer 
might use to detect when overflow has occurred, without using 


the machine's "status bits" that are often supplied expressly for 
this purpose. This is important, because some machines do not 
have such status bits (e.g., MIPS), and even if the machine is so 
equipped, it is often difficult or impossible to access the bits 
from a high-level language. 


Signed Add/Subtract 


When overflow occurs on integer addition and subtraction, 
contemporary machines invariably discard the high-order bit of 
the result and store the low-order bits that the adder naturally 
produces. Signed integer overflow of addition occurs if and only 
if the operands have the same sign and the sum has a sign 
opposite to that of the operands. Surprisingly, this same rule 
applies even if there is a carry into the adder—that is, if the 
calculation is x + y + 1. This is important for the application of 
adding multiword signed integers, in which the last addition is a 
signed addition of two fullwords and a carry-in that may be 0 or 
+1. 


To prove the rule for addition, let x and y denote the values 
of the one-word signed integers being added, let c (carry-in) be 0 
or 1, and assume for simplicity a 4-bit machine. Then if the signs 
of x and y are different, 


-8 <x<-l, and 


0<у<7, 


or similar bounds apply if x is nonnegative and у is negative. In 
either case, by adding these inequalities and optionally adding 
in 1 for c, 


-б<х+у+с<7. 


This is representable as a 4-bit signed integer, and thus overflow 
does not occur when the operands have opposite signs. 


Now suppose x and y have the same sign. There are two 
cases: 


-8sxs-1 Üsxs7 
-8sys-l 0<y<7 
Thus, 
(a) (b) 
—16<х+у+с<—1 Usxtytesl5 


Overflow occurs if the sum is not representable as a 4-bit signed 
integer—that is, if 


(a) (b) 
-l16<xt+yt+cs-9 esa ty tes 15. 


In case (a), this is equivalent to the high-order bit of the 4-bit 
sum being 0, which is opposite to the sign of x and y. In case 
(b), this is equivalent to the high-order bit of the 4-bit sum being 
1, which again is opposite to the sign of x and y. 


For subtraction of multiword integers, the computation of 
interest is x — y — c, where again c is 0 or 1, with a value of 1 
representing a borrow-in. From an analysis similar to the above, 
it can be seen that overflow in the final value of x — y — c 
occurs if and only if x and y have opposite signs and the sign of 
x — у — cis opposite to that of x (or, equivalently, the same as 
that of y). 


This leads to the following expressions for the overflow 
predicate, with the result being in the sign position. Following 
these with a shift right or shift right signed of 31 produces a 1/0- 
or a — 1/0-valued result. 


x TF p TE х-у-с 

(х = у) & (х * y * c) € x) (x © y) & (x -y ^ c) Өх) 

((х+у+с) Өх) & ((х+у+е) Фу) ((х-у-с) Өх) & ((х-у-с) =у) 
By choosing the second alternative in the first column, and the 
first alternative in the second column (avoiding the equivalence 
operation), our basic RISC can evaluate these tests with three 
instructions in addition to those required to compute x + y + c 
or X — y — c. A fourth instruction (branch if negative) can be 


added to branch to code where the overflow condition is 
handled. 


If executing with overflow interrupts enabled, the 
programmer may wish to test to see if a certain addition or 
subtraction will cause overflow, in a way that does not cause it. 
One branch-free way to do this is as follows: 


х+у+с x-y-c 
z < (x = y) & 0x80000000 z < (x 9 y) & 0x80000000 
z&(((x @z)+y+c)= y) z&(((x @z)—y—c)@ y) 


The assignment to z in the left column sets z = 0x80000000 if 
x and y have the same sign, and sets z = 0 if they differ. Then, 
the addition in the second expression is done with x @ z and y 
having different signs, so it can't overflow. If x and y are 
nonnegative, the sign bit in the second expression will be 1 if 
and опу if (x — 231) + y + c > 0—that is, iff x + y + c > 
231, which is the condition for overflow in evaluating x + y + 
c. If x and y are negative, the sign bit in the second expression 
will be 1 iff (x + 231) + y + c < 0—that is, iff x + y + c < 
— 231, which again is the condition for overflow. The and with z 
ensures the correct result (0 in the sign position) if x and y have 
opposite signs. Similar remarks apply to the case of subtraction 
(right column). The code executes in nine instructions on the 
basic RISC. 


It might seem that if the carry from addition is readily 
available, this might help in computing the signed overflow 
predicate. This does not seem to be the case; however, one 
method along these lines is as follows. 


If x is a signed integer, then x + 231 is correctly represented 
as an unsigned number and is obtained by inverting the high- 
order bit of x. Signed overflow in the positive direction occurs if 
x + у = 231_ that is, if (x + 231) + (y + 231) > 3- 231, This 
latter condition is characterized by carry occurring in the 
unsigned add (which means that the sum is greater than or equal 
to 232) and the high-order bit of the sum being 1. Similarly, 
overflow in the negative direction occurs if the carry is 0 and the 
high-order bit of the sum is also 0. 


This gives the following algorithm for detecting overflow for 
signed addition: 


Compute (x Ф 231) + (уф 231), giving sum s and carry 
C. 
Overflow occurred iff c equals the high-order bit of s. 


The sum is the correct sum for the signed addition, because 
inverting the high-order bits of both operands does not change 
their sum. 


For subtraction, the algorithm is the same except that in the 
first step a subtraction replaces the addition. We assume that the 
carry is that which is generated by computing x — y as x + Ў 
+ 1. The subtraction is the correct difference for the signed 
subtraction. 


These formulas are perhaps interesting, but on most machines 
they would not be quite as efficient as the formulas that do not 
even use the carry bit (e.g., overflow = (x = y)& (s @ x) for 
addition, and (x Ф y) &(d @ x) Юг subtraction, where s and d 
are the sum and difference, respectively, of x and y). 


How the Computer Sets Overflow for Signed Add/Subtract 


Machines often set *overflow" for signed addition by means of 
the logic “the carry into the sign position is not equal to the 
carry out of the sign position." Curiously, this logic gives the 
correct overflow indication for both addition and subtraction, 
assuming the subtraction x — y is done by x + J + 1. 
Furthermore, it is correct whether or not there is a carry- or 
borrow-in. This does not seem to lead to any particularly good 
methods for computing the signed overflow predicate in 
software, however, even though it is easy to compute the carry 
into the sign position. For addition and subtraction, the carry/ 
borrow into the sign position is given by the sign bit after 
evaluating the following expressions (where c is 0 or 1): 


carry borrow 


(х+у+с) ӨхӨу (х-у-с)ФхФу 


In fact, these expressions give, at each position i, the carry/ 
borrow into position i. 


Unsigned Add/Subtract 


The following branch-free code can be used to compute the 


overflow predicate for unsigned add/subtract, with the result 
being in the sign position. The expressions involving a right shift 
are probably useful only when it is known that с = 0. The 
expressions in brackets compute the carry or borrow generated 
from the least significant position. 


x+y +c, unsigned 
(x &y) | (x | y & —(х+у+с)) 
(x > 1) +(y Z 1) + [((x & y) | (х 


y)&c)) & 1] 


x —y—c, unsigned 
(2x &y) | (xay) &(x-y-c)) 
(κό) | (эх | y) & (х-у-с)) 


(x  1)- (у> 10-[(ax&y) | (Ax | y) ο) & 1] 

For unsigned add’s and subtract’s, there are much simpler 
formulas in terms of comparisons [MIPS]. For unsigned addition, 
overflow (carry) occurs if the sum is less (by unsigned 
comparison) than either of the operands. This and similar 
formulas are given below. Unfortunately, there is no way in 
these formulas to allow for a variable c that represents the carry- 
or borrow-in. Instead, the program must test c, and use a 
different type of comparison depending upon whether c is 0 or 
1. 


x-y.unsigned x+y+1, unsigned х-у, unsigned х-у-1, unsigned 
ax ty ax<y х<у х<у 
xtyzx х+у+1<х x-y2x x-y-12x 

The first formula for each case above is evaluated before the 
add/subtract that may overflow, and it provides a way to do the 
test without causing overflow. The second formula for each case 
is evaluated after the add/subtract that may overflow. 

There does not seem to be a similar simple device (using 
comparisons) for computing the signed overflow predicate. 


Multiplication 


For multiplication, overflow means that the result cannot be 
expressed in 32 bits (it can always be expressed in 64 bits, 
whether signed or unsigned). Checking for overflow is simple if 
you have access to the high-order 32 bits of the product. Let us 
denote the two halves of the 64-bit product by hi(x x y) and 
lo(x x y). Then the overflow predicates can be computed as 
follows [MIPS]: 


x x y, unsigned x X y. signed 
hi(x x y) #0 hi(x x y) = (lo(x x y) >> 31) 


One way to check for overflow of multiplication is to do the 
multiplication and then check the result by dividing. Care must 
be taken not to divide by 0, and there is a further complication 
for signed multiplication. Overflow occurs if the following 
expressions are true: 


Unsigned Signed 
ζ«--χαγ Z<— x * y 
y#=0&zZyz x (y«0&x--23) | (ys0 & z+ yz x) 
The complication arises when x = – 231 and y = —1. In this 


case the multiplication overflows, but the machine may very 
well give a result of — 231. This causes the division to overflow, 
and thus any result is possible (for some machines). Therefore, 
this case has to be checked separately, which is done by the term 
y < 0 & x = – 231, The above expressions use the “conditional 
and" operator to prevent dividing by 0 (in C, use the єє 
operator). 


It is also possible to use division to check for overflow of 
multiplication without doing the multiplication (that is, without 
causing overflow). For unsigned integers, the product overflows 
iff xy > 232 — 1, or x > ((232 — 1)/у), or, since x is an integer, 
x > (232 — 1)/y). Expressed in computer arithmetic, this is 


y 0 & x (0xFFFFFFFF 2 y). 


For signed integers, the determination of overflow of x * y is 
not so simple. If x and y have the same sign, then overflow 
occurs iff xy > 231 — 1. If they have opposite signs, then 


overflow occurs iff xy « —231. These conditions can be tested 
as indicated in Table 2-2, which employs signed division. This 
test is awkward to implement, because of the four cases. It is 
difficult to unify the expressions very much because of problems 
with overflow and with not being able to represent the number 
4 231, 

The test can be simplified if unsigned division is available. 
We can use the absolute values of x and y, which are correctly 
represented under unsigned integer interpretation. The complete 
test can then be computed as shown below. The variable c — 
231 — 1 if x and y have the same sign, and c = 231 otherwise. 


TABLE 2-2. OVERFLOW TEST FOR SIGNED MULTIPLICATION 


х > OX7FFFFFFF = у y < 0x8000 0000 + x 


x < 0x80000000 = y х #0 & у < 0x7FFFFFFF +x 


c<—((x=y) 31 ) + 231 
x < abs(x) 
y < abs(y) 

-ЭУ 
yz0&xz(cZy) 

The number of leading zeros instruction can be used to give an 
estimate of whether or not x * y will overflow, and the estimate 
can be refined to give an accurate determination. First, consider 
the multiplication of unsigned numbers. It is easy to show that if 
x and y, as 32-bit quantities, have m and n leading O's, 
respectively, then the 64-bit product has either т + norm + n 
+ 1 leading 05 (or 64, if either x = O or y = 0). Overflow 


occurs if the 64-bit product has fewer than 32 leading 0’s. 
Hence, 


nlz(x) + nlz(y) = 32: Multiplication definitely does not 
overflow. 
nlz(x) + nlz(y) < 30: Multiplication definitely does 
overflow. 


For nlz(x) + nlz(y) = 31, overflow may or may not occur. In 


this case, the overflow assessment can be made by evaluating t 
= xl y/2 |. This will not overflow. Since xy is 2t or, if y is odd, 2t 
+ x, the product xy overflows if t > 231. These considerations 
lead to a plan for computing xy, but branching to *overflow" if 
the product overflows. This plan is shown in Figure 2-2. 


For the multiplication of signed integers, we can make a 
partial determination of whether or not overflow occurs from 
the number of leading 0’s of nonnegative arguments, and the 
number of leading 1's of negative arguments. Let 


m = nlz(x) + nlz(X). and 
nlz(y) + nlz(y). 


Click here to view code image 


n 


unsigned x, y, z, m, n, t; 


m = nlz(x); 

n — nlz(y); 

if (m + n <= 30) goto overflow; 
t = x*(y >> 1); 

if ((int)t < 0) goto overflow; 
z = t*2; 


if (z < x) goto overflow; 
// z is the correct product of x and y. 
FIGURE 2-2. Determination of overflow of unsigned 
multiplication. 


Then, we have 


m+n 
m+n 


34: Multiplication definitely does not overflow. 
31: Multiplication definitely does overflow. 


IA IV 


There are two ambiguous cases: 32 and 33. The case т + п 
— 33 overflows only when both arguments are negative and the 
true product is exactly 231 (machine result is — 231), so it can be 
recognized by a test that the product has the correct sign (that 
is, overflow occurred if m © n © (m * п) < 0). When m + n = 
32, the distinction is not so easily made. 


We will not dwell on this further, except to note that an 
overflow estimate for signed multiplication can also be made 
based on nlz(abs(x)) + nlz(abs(y)) but again there are two 
ambiguous cases (a sum of 31 or 32). 


Division 
For the signed division x + y, overflow occurs if the following 
expression is true: 


y = 0 | (х = 0x80000000 & у = —1) 
Most machines signal overflow (or trap) for the indeterminate 


form 0 + O. 


Straightforward code for evaluating this expression, including 
a final branch to the overflow handling code, consists of seven 
instructions, three of which are branches. There do not seem to 
be any particularly good tricks to improve on this, but here are a 
few possibilities: 


[abs(y Ф 0x80000000) | (abs(x) & abs(y = 0x80000000))] < 0 


That is, evaluate the large expression in brackets, and branch if 
the result is less than 0. This executes in about nine instructions, 
counting the load of the constant and the final branch, on a 
machine that has the indicated instructions and that gets the 
“compare to 0” for free. 


Some other possibilities are to first compute z from 
2 < (x Ө 0x80000000) | (y + 1) 


(three instructions on many machines), and then do the test and 
branch on y = 0 | z = 0 in one of the following ways: 


(y | y) & (z | -z)) 20 
(nabs(y) & nabs(z)) = 0 


((nlz(y) | nlz(z)) > 5) #0 


These execute in nine, seven, апа eight instructions, 
respectively, on a machine that has the indicated instructions. 
The last line represents a good method for PowerPC. 


” For ae unsigned division x £ t y, overflow occurs if and only 
ify — 

Some machines have a "long division" instruction (see page 
192) and you may want to predict, using elementary 
instructions, when it would overflow. We will discuss this in 
terms of an instruction that divides a doubleword by a fullword, 
producing a fullword quotient and possibly also a fullword 
remainder. 


Such an instruction overflows if either the divisor is O or if 
the quotient cannot be represented in 32 bits. Typically, in these 
overflow cases both the quotient and remainder are incorrect. 
The remainder cannot overflow in the sense of being too large to 
represent in 32 bits (it is less than the divisor in magnitude), so 
the test that the remainder will be correct is the same as the test 
that the quotient will be correct. 


We assume the machine either has 64-bit general registers or 
32-bit registers and there is no problem doing elementary 
operations (shifts, adds, and so forth) on 64-bit quantities. For 
example, the compiler might implement a doubleword integer 
data type. 


In the unsigned case the test is trivial: for x + y with x a 
doubleword and y a fullword, the division will not overflow if 
(and only if) either of the following equivalent expressions is 
true. 


у+0&х < (y < 32) 


у+0& (x 32) <у 
Оп а 32-bit machine, the shifts need not be done; simply 
compare y to the register that contains the high-order half of x. 
To ensure correct results on a 64-bit machine, it is also necessary 
to check that the divisor y is a 32-bit quantity (e.g., check that 
(у >> 32) = 0 ) 

The signed case is more interesting. It is first necessary to 
check that у = 0 and, on a 64-bit machine, that M is correctly 
represented in 32 bits (check that ((y < 32) >> 32) = y) 
Assuming these tests have been done, the table that follows 
shows how the tests might be done to determine precisely 
whether or not the quotient is representable in 32 bits by 


considering separately the four cases of the dividend and divisor 
each being positive or negative. The expressions in the table are 
in ordinary arithmetic, not computer arithmetic. 


In each column, each relation follows from the one above it 
in an if-and-only-if way. To remove the floor and ceiling 
functions, some relations from Theorem D1 on page 183 are 
used. 


х20,у»0 х20,у«0 х«0,у»0 х<0,у<0 

| x/y | < 23! [x/y]2-23! [x/y]2-23! | х/у | «23! 

х/у < 22! [x/y]»-23!-1 [x/y]»-23-1  x/y«2?! 

x«23y x/y»—-231-] x/y»—-231-| х> 231у 
x«-23y-y х»-23у-у —х < 231(-y) 
х< 231(-у) + (-у) -x<2lyt+y 


As an example of interpreting this table, consider the leftmost 
column. It applies to the case in which x = 0 and y > O. In this 
case the quotient is | x/y |, and this must be strictly less than 231 
to be representable as a 32-bit quantity. From this it follows that 
the real number x/y must be less than 231, or x must be less 
than 231y. This test can be implemented by shifting y left 31 
positions and comparing the result to x. 


When the signs of x and y differ, the quotient of conventional 
division is Гх/у1. Because the quotient is negative, it can be as 
small as — 231, 


In the bottom row of each column the comparisons are all of 
the same type (less than). Because of the possibility that x is the 
maximum negative number, in the third and fourth columns an 
unsigned comparison must be used. In the first two columns the 
quantities being compared begin with a leading O-bit, so an 
unsigned comparison can be used there, too. 


These tests can, of course, be implemented by using 
conditional branches to separate out the four cases, doing the 
indicated arithmetic, and then doing a final compare and branch 
to the code for the overflow or non-overflow case. However, 
branching can be reduced by taking advantage of the fact that 
when y is negative, —y is used, and similarly for x. Hence the 
tests can be made more uniform by using the absolute values of 
x and y. Also, using a standard device for optionally doing the 
additions in the second and third columns results in the 


following scheme: 


x’ = |x| 
, 


y'= [yl 


б = ((х Q y) > 63) & y' 


if (x' = (y! << 31) + 8) then {will not overflow} 


Using the three-instruction method of computing the absolute 
value (see page 18), on a 64-bit version of the basic RISC this 
amounts to 12 instructions, plus a conditional branch. 


2-14 Condition Code Result of Add, Subtract, and 
Multiply 


Many machines provide a “condition code" that characterizes 
the result of integer arithmetic operations. Often there is only 
one add instruction, and the characterization reflects the result 
for both unsigned and signed interpretation of the operands and 
result (but not for mixed types). The characterization usually 
consists of the following: 


* Whether or not carry occurred (unsigned overflow) 
* Whether or not signed overflow occurred 


* Whether the 32-bit result, interpreted as a signed two's- 
complement integer and ignoring carry and overflow, is 
negative, 0, or positive 


Some older machines give an indication of whether the 
infinite precision result (that is, 33-bit result for add's and 
subtract's) is positive, negative, or 0. However, this indication is 
not easily used by compilers of high-level languages, and so has 
fallen out of favor. 


For addition, only nine of the 12 combinations of these 
events are possible. The ones that cannot occur are ^no carry, 
overflow, result > 0," “no carry, overflow, result = 0," and 
“carry, overflow, result < 0.” Thus, four bits are, just barely, 
needed for the condition code. Two of the combinations are 
unique in the sense that only one value of inputs produces them: 
Adding 0 to itself is the only way to get “no carry, no overflow, 


result — 0," and adding the maximum negative number to itself 
is the only way to get “carry, overflow, result = 0.” These 
remarks remain true if there is a “carry in"—that is, if we are 
computingx +y +1. 


For subtraction, let us assume that to compute x - y the 


machine actually computes x + У + 1, with the carry produced 
as for an add (in this scheme the meaning of “carry” is reversed 
for subtraction, in that carry = 1 signifies that the result fits in a 
single word, and carry = 0 signifies that the result does not fit 
in a single word). Then for subtraction, only seven combinations 
of events are possible. The ones that cannot occur are the three 
that cannot occur for addition, plus “no carry, no overflow, 
result = 0,” and “carry, overflow, result = 0.” 


If a machine’s multiplier can produce a doubleword result, 
then two multiply instructions are desirable: one for signed and 
one for unsigned operands. (On a 4-bit machine, in hexadecimal, 
F x F = 01 signed, and Е x Е = El unsigned.) For these 
instructions, neither carry nor overflow can occur, in the sense 
that the result will always fit in a doubleword. 


For a multiplication instruction that produces a one-word 
result (the low-order word of the doubleword result), let us take 
“carry” to mean that the result does not fit in a word with the 
operands and result interpreted as unsigned integers, and let us 
take “overflow” to mean that the result does not fit in a word 
with the operands and result interpreted as signed two’s- 
complement integers. Then again, there are nine possible 
combinations of results, with the missing ones being “no carry, 
overflow, result > 0," “no carry, overflow, result = 0," and 
“carry, no overflow, result = 0.” Thus, considering addition, 
subtraction, and multiplication together, ten combinations can 
occur. 


2-15 Rotate Shifts 


These are rather trivial. Perhaps surprisingly, this code works for 
n ranging from 0 to 32 inclusive, even if the shifts are mod-32. 


Rotate left à: y<(x<n) | (x = (32 — n)) 


Rotate right n: у < (x Z n) | (x < (32 — n)) 


If your machine has double-length shifts, they can be used to 
do rotate shifts. These instructions might be written 


Click here to view code image 


shldi RT,RA,RB,I 
shrdi RT,RA,RB,I 


They treat the concatenation of RA and RB as a single double- 
length quantity, and shift it left or right by the amount given by 
the immediate field 1. (If the shift amount is in a register, the 
instructions are awkward to implement on most RISCs because 
they require reading three registers.) The result of the left shift is 
the high-order word of the shifted double-length quantity, and 
the result of the right shift is the low-order word. 


Using sniái, a rotate left of Rx can be accomplished by 
Click here to view code image 
shldi RT,Rx,Rx,I 


and similarly a rotate right shift can be accomplished with 
shrdi. 


A rotate left shift of one position can be accomplished by 
adding the contents of a register to itself with “end-around 
carry" (adding the carry that results from the addition to the 
sum in the low-order position). Most machines do not have that 
instruction, but on many machines it can be accomplished with 
two instructions: (1) add the contents of the register to itself, 
generating a carry (into a status register), and (2) add the carry 
to the sum. 


2-16 Double-Length Add/Subtract 


Using one of the expressions shown on page 31 for overflow of 
unsigned addition and subtraction, we can easily implement 
double-length addition and subtraction without accessing the 
machine's carry bit. To illustrate with double-length addition, let 
the operands be (x1, xo) and (y1, yo), and the result be (21, Zo). 
Subscript 1 denotes the most significant half, and subscript 0 the 
least significant. We assume that all 32 bits of the registers are 
used. The less significant words are unsigned quantities. 


Zo < Xo + Yo 


u 
ce [xo & yo) | (ας | Jo) & 329)] 3l 
z «x ty te 
This executes in nine instructions. The second line can be 
йр --- 
с €- (zo - хо), permitting a four-instruction solution on 
machines that have this comparison operator in a form that 


gives the result as a 1 or 0 in a register, such as the “situ” (Set 
on Less Than Unsigned) instruction on MIPS [MIPS]. 


Similar code for double-length subtraction (x - y) is 


20 <— Xo Хо 


- 


3 W 
b < [(AX & yo) | (xo 9 у) & &0)] > 31 
£| €— Xi m s =: b 
This executes in eight instructions on a machine that has a full 
set of logical instructions. The second line can be 
= и " I А 
b < (хо = у о), permitting a four-instruction solution on 
machines that have the “situ” instruction. 
Double-length addition and subtraction can be done in five 
instructions on most machines by representing the multiple- 
length data using only 31 bits of the least significant words, with 


the high-order bit being 0 except momentarily when it contains 
a carry or borrow bit. 


2-17 Double-Length Shifts 


Let (x1, xo) be a pair of 32-bit words to be shifted left or right as 
if they were a single 64-bit quantity, with x; being the most 
significant half. Let (y1, yo) be the result, interpreted similarly. 
Assume the shift amount n is a variable ranging from 0 to 63. 
Assume further that the machine's shift instructions are modulo 
64 or greater. That is, a shift amount in the range 32 to 63 or - 
32 to -1 results in ап all-0 word, unless the shift is a signed right 
shift, in which case the result is 32 sign bits from the word 
shifted. (This code will not work on the Intel x86 machines, 
which have mod-32 shifts.) 


Under these assumptions, the shift left double operation can be 


accomplished as follows (eight instructions): 


yi €x, «n | Χρ >> (32-п) | хо < (n — 32) 


Уо < Xon 


The main connective in the first assignment must be ог, not plus, 
to give the correct result when n = 32. If it is known that 0 < n 
< 32, the last term of the first assignment can be omitted, 
giving a six-instruction solution. 


Similarly, a shift right double unsigned operation can be done 
with 


yo € xo > | x, < (32 — n) | x, > (n — 32) 


u 


Ji -- X] > n 


Shift right double signed is more difficult, because of an 
unwanted sign propagation in one of the terms. Straightforward 
code follows: 


if n < 32 then yy — xy > n 


x, < (32 — n) 
else yg < x, > (n — 32) 


Уре x 2n 

If your machine has the conditional move instructions, it is a 
simple matter to express this in branch-free code, in which form 
it takes eight instructions. If the conditional move instructions 
are not available, the operation can be done in ten instructions 
by using the familiar device of constructing a mask with the shift 
right signed 31 instruction to mask the unwanted sign 
propagating term: 


yo € Xo» | ху < (32 —n) | [Gc > (п-32)) & (32 — n)  31)] 
y ex 2n 
2-18 Multibyte Add, Subtract, Absolute Value 


Some applications deal with arrays of short integers (usually 


bytes or halfwords), and often execution is faster if they are 
operated on a word at a time. For definiteness, the examples 
here deal with the case of four 1-byte integers packed into a 
word, but the techniques are easily adapted to other packings, 
such as a word containing a 12-bit integer and two 10-bit 
integers, and so on. These techniques are of greater value on 64- 
bit machines, because more work is done in parallel. 


Addition must be done in a way that blocks the carries from 
one byte into another. This can be accomplished by the 
following two-step method: 

1. Mask out the high-order bit of each byte of each operand 
and add (there will then be no carries across byte 
boundaries). 

2. Fix up the high-order bit of each byte with a 1-bit add of 
the two operands and the carry into that bit. 

The carry into the high-order bit of each byte is given by the 

high-order bit of each byte of the sum computed in step 1. The 
subsequent similar method works for subtraction: 


Addition 
5 < (x & 0х7Е7Е7Е7Е) + (y & 0х7Е7Е7Е7Е) 
5 < ((x @ y) & 0x80808080) © s 


Subtraction 
d < (x | 0x80808080) — (y & 0х7Е7Е7Е7Е) 
d = ((x © y) | 0х7Е7Е7Е7Е)-4 


These execute in eight instructions, counting the load of 
0х7Е7Е7Е7Е, оп a machine that has a full set of logical 
instructions. (Change the and and or of 0x80808080 to and not 
and or not, respectively, of 0х7Е7 Е7Е7Е.) 


There is a different technique for the case in which the word 
is divided into only two fields. In this case, addition can be done 
by means of a 32-bit addition followed by subtracting out the 
unwanted carry. On page 30 we noted that the expression (x + 
y) @ x @ y gives the carries into each position. Using this and 
similar observations about subtraction gives the following code 


for adding/subtracting two halfwords modulo 216 (seven 
instructions): 


Addition Subtraction 

sexty d<ex-y 

с< (s € x € y) & 0x0001 0000 b <= (d € x € y) & 0x00010000 
$«s-c d<—d+b 


Multibyte absolute value is easily done by complementing and 
adding 1 to each byte that contains a negative integer (that is, 
has its high-order bit on). The following code sets each byte of y 
equal to the absolute value of each byte of x (eight instructions): 


a «- x & 0x80808080 // Isolate signs. 


b<eas7 // Integer 1 where x is negative. 
m«-(a—-b)|a // OXFF where x is negative. 
y € (x m) + // Complement and add 1 where negative. 


The third line could as well be т — a + a — b. The addition of 
b in the fourth line cannot carry across byte boundaries, because 
the quantity x Ф т has a high-order О in each byte. 


2-19 Doz, Max, Min 


The “doz” function is “difference or zero,” defined as follows: 


Signed Unsigned 

Р š Р А у š „а 

doz(x, y) = [x κά жеу, dozu(x, y) = | У x2» 

i 0. x<y. | 0, x £y. 
It has been called “first grade subtraction” because the result is 0 
if you try to take away too much.3 If implemented as a computer 
instruction, perhaps its most important use is to implement the 
max(x, y) and min(x, y) functions (in both signed and unsigned 
forms) in just two simple instructions, as will be seen. 
Implementing max(x, y) and min(x, y) in hardware is difficult 
because the machine would need paths from the output ports of 
the register file back to an input port, bypassing the adder. 
These paths are not normally present. If supplied, they would be 
in a region that's often crowded with wiring for register 
bypasses. The situation is illustrated in Figure 2-3. The adder is 
used (by the instruction) to do the subtraction x — y. The high- 


order bits of the result of the subtraction (sign bit and carries, as 
described on page 27) define whether x 2 y or x « y The 
comparison result is fed to a multiplexor (MUX) that selects 
either x or y as the result to write into the target register. These 
paths, from register file outputs x and y to the multiplexor, are 
not normally present and would have little use. The difference or 
zero instructions can be implemented without these paths 
because it is the output of the adder (or 0) that is fed back to the 
register file. 


Register File 


FIGURE 2-3. Implementing max(x, y) and min(x, y). 


Using difference or zero, max(x, y) and min(x, y) can be 
implemented in two instructions as follows: 


Signed Unsigned 
max(x, у) = y + doz(x, у) maxu(x, y) = y + dozu(x, y) 
min(x, y) = x — doz(x, y) minu(x, y) = x — dozu(x, y) 


In the signed case, the result of the difference or zero 
instruction can be negative. This happens if overflow occurs in 
the subtraction. Overflow should be ignored; the addition of y or 
subtraction from x will overflow again, and the result will be 
correct. When doz(x, y) is negative, it is actually the correct 
difference if it is interpreted as an unsigned integer. 


Suppose your computer does not have the difference or zero 
instructions, but you want to code doz(x, y), max(x, y), and so 
forth, in an efficient branch-free way. In the next few paragraphs 
we show how these functions might be coded if your machine 
has the conditional move instructions, comparison predicates, 
efficient access to the carry bit, or none of these. 


If your machine has the conditional move instructions, it can 
get doz(x, y) in three instructions, and destructive4 max(x, y) 
and min(x, y) in two instructions. For example, on the full RISC, 
2 < doz(x, y) can be calculated as follows (10 is a permanent 
zero register): 


Click here to view code image 
sub 7,Х,У Set 2 =х - y. 


cmplt t,x,y Set t 1 if x < y, else 0. 
movne z,Lt,r0 Set z 0 if x < y. 


Also on the full RISC, x < max(x, y) can be calculated as 
follows: 


Click here to view code image 


1 if х < y, else 0. 
y if x < y. 


cmplt t,x,y Set t 
movne X; ty Set x 


The min function, and the unsigned counterparts, are obtained 
by changing the comparison conditions. 


These functions can be computed in four or five instructions 
using comparison predicates (three or four if the comparison 
predicates give a result of —1 for “true”): 


|| 


doz(x, у) = (x-y) & -(x > y) 
max(x, y) = y + doz(x, y) 

((x By) & -(х>у)) Фу 
min(x, y) = x — doz(x, y) 

((х Ө у) & -(х<у)) Фу 


On some machines, the carry bit may be a useful aid to 
computing the unsigned versions of these functions. Let carry(x 
— y) denote the bit that comes out of the adder for the 


|| 


operation x+ J + 1, moved to a GPR. Thus, carry(x — y) = 1 
iff x 2 y. Then we have 


dozu(x, y) = ((х-у) & —(сату(х — y) - 1)) 
maxu(x, y) = x - ((x - y) & (carry(x — y) - 1)) 


minu(x, y) = y + ((x — y) & (carry(x — y) - 1)) 


On most machines that have a subtract that generates a carry 
or borrow, and another form of subtract that uses that carry or 
borrow as an input, the expression carry(x — y) — 1 can be 
computed in one more instruction after the subtraction of y from 
x. For example, on the Intel x86 machines, minu(x, y) can be 
computed in four instructions as follows: 


Click here to view code image 


sub eax,ecx 
sbb edx,edx 
and eax,edx 
add eax,ecx 


; Inputs x and y are in eax and ecx resp. 
; edx = 0 if x >= y, else -1. 

; 0 if x >= y, else х = y. 

; Add y, giving y if x >= y, else x. 

In this way, all three of the functions can be computed in four 
instructions (three instructions for dozu(x, y) if the machine has 


and with complement). 


A method that applies to nearly any RISC is to use one of the 
above expressions that employ a comparison predicate, and to 
substitute for the predicate one of the expressions given on page 
23. For example: 


d«€-x—y 
doz(x, y) = d & [(d  ((x Фу) & (d O x))) > 31] 


dozu(x, y) = d&-[((Ax & y) | (xay) & d)) > 31] 


These require from seven to ten instructions, depending on the 
computer's instruction set, plus one more to get max or min. 


These operations can be done in four branch-free basic RISC 
instructions if it is known that —231 « x — y < 231 — 1 (that 
is an expression in ordinary arithmetic, not computer 
arithmetic). The same code works for both signed and unsigned 
integers, with the same restriction on x and y. A sufficient 


condition for these formulas to be valid is that, for signed 
integers, —230 « x, y « 230 — 1, and for unsigned integers, 0 
< ху «251 —1. 


doz(x, y) = dozu(x, у) = (x-y) & A((x—y) >= 31) 
тах(х, y) = maxu(x, y) = x ((х-у) & ((x—y) > 31)) 
тіп(х, y) = minu(x, y) = y + ((x —y) & ((x — y) > 31)) 
Some uses of the difference or zero instruction are given here. 


In these, the result of doz(x, y) must be interpreted as an 
unsigned integer. 


1. It directly implements the Fortran IDIM function. 
2. To compute the absolute value of a difference [Knu7]: 


Ix — yl doz(x, y) + doz(y, x). signed arguments, 
= dozu(x, y) + dozu(y, x). unsigned arguments. 


Corollary: |x| = doz(x, 0) + doz(0, x) (other three- 
instruction solutions are given on page 18). 


3. To clamp the upper limit of the true sum of unsigned 
integers x and y to the maximum positive number (232 — 
1) [Knu7]: 


^dozu(^x, y). 
4. Some comparison predicates (four instructions each): 


x»y = (doz(x, y) | —doz(x, y)) > 31. 


^ 


(dozu(x, y) | — dozu(x, y)) > 31. 


|| 


xz у 


5. The carry bit from the addition x + y (five instructions): 


carry(x + y) = x ~у = (dozu(x, ^y) | —dozu(x, —у)) >> 31. 
The expression doz(x, —y), with the result interpreted as an 
unsigned integer, is in most cases the true sum x + y with the 


lower limit clamped at 0. However, it fails if y is the maximum 
negative number. 


The IBM RS/6000 computer, and its predecessor the 801, 
have the signed version of difference or zero. Knuth's MMIX 


computer [Knu7] has the unsigned version (including some 
varieties that operate on parts of words in parallel). This raises 
the question of how to get the signed version from the unsigned 
version, and vice versa. This can be done as follows (where the 
additions and subtractions simply complement the sign bit): 


doz(x,y) = dozu(x + 231, y + 231), 
dozu(x,y) — doz(x — 231, y — 231). 


Some other identities that may be useful are: 


doz(-x, -γ) = doz(y, х), 
dozu(-x, -у) = dozu(y, x). 


The relation doz(—x, —y) = doz(y, x) fails if either x or y, but 
not both, is the maximum negative number. 


2-20 Exchanging Registers 


A very old trick is exchanging the contents of two registers 
without using a third [IBM]: 


x<x@y 
у= yx 
x<x@y 


This works well on a two-address machine. The trick also 
works if Ф is replaced by the = logical operation (complement 
of exclusive or) and can be made to work in various ways with 
add’s and subtract's: 


xext+y x€x-y x«y-x 
yex-y yeytx yey-x 
x«x-y x«y-x xexty 


Unfortunately, each of these has an instruction that is unsuitable 
for a two-address machine, unless the machine has “reverse 
subtract.” 


This little trick can actually be useful in the application of 
double buffering, in which two pointers are swapped. The first 
instruction can be factored out of the loop in which the swap is 
done (although this negates the advantage of saving a register): 


Outside the loop: # — x @ y 

Inside the loop: x — x ФЕ 

уу; 

Exchanging Corresponding Fields of Registers 


The problem here is to exchange the contents of two registers x 
and y wherever a mask bit mi — 1, and to leave x and y 
unaltered wherever mi = 0. By “corresponding” fields, we mean 
that no shifting is required. The 1-bits of m need not be 
contiguous. The straightforward method is as follows: 


х' «< (x & m) | (y & m) 
y< (y & m) | (x & m) 


x €x 
By using “temporaries” for the four and expressions, this can be 
seen to require seven instructions, assuming that either m or Ht 
can be loaded with a single instruction and the machine has and 
not as a single instruction. If the machine is capable of executing 
the four (independent) and expressions in parallel, the execution 
time is only three cycles. 


A method that is probably better (five instructions, but four 
cycles on a machine with unlimited instruction-level parallelism) 
is shown in column (a) below. It is suggested by the “three 
exclusive or" code for exchanging registers. 


(a) (b) (c) 
хєхӨфу x€-xzy te(xOy)&m 
ye y @ (x & m) y< y=(x | m) x< x@t£# 
x< x@y xex=y yey®t 


The steps in column (b) do the same exchange as that of column 
(a), but column (b) is useful if m does not fit in an immediate 
field, but ΠῚ does, and the machine has the equivalence 
instruction. 


Still another method is shown in column (c) above [GLS1]. It 
also takes five instructions (again assuming one instruction must 
be used to load rm into a register), but executes in only three 


cycles on a machine with sufficient instruction-level parallelism. 


Exchanging Two Fields of the Same Register 


Assume a register x has two fields (of the same length) that are 
to be swapped, without altering other bits in the register. That 
is, the object is to swap fields B and D without altering fields A, 
C, and E, in the computer word illustrated below. The fields are 
separated by a shift distance k. 


i — &— 


Straightforward code would shift D and B to their new 
positions, and combine the words with and and or operations, as 
follows: 


- 
|| 


(х & т) << Kk 
t, = (x > k) & m 


х= (x &m') | ft | f, 


Here, m is a mask with 1’s in field D (and 0’s elsewhere), and m’ 
is a mask with 1’s in fields A, C, and E. This code requires 11 
instructions and six cycles on a machine with unlimited 
instruction-level parallelism, allowing for four instructions to 
generate the two masks. 


A method that requires only eight instructions and executes 
in five cycles, under the same assumptions, is shown below 
[GLS1]. It is similar to the code in column (c) on page 46 for 
interchanging corresponding fields of two registers. Again, m is a 
mask that isolates field D. 


t, = [x6 (х5 k)] & m 
t, = t < k 
x =xOt, Ot, 
The idea is that t; contains B Ф D in position D (and 0’s 


elsewhere), and #2 contains B Ф D in position B. This code, and 
the straightforward code given earlier, work correctly if B and D 
are “split fields"—that is, if the 1-bits of mask m are not 
contiguous. 


Conditional Exchange 


The exchange methods of the preceding two sections, which are 
based on exclusive or, degenerate into no-operations if the mask 
m is 0. Hence, they can perform an exchange of entire registers, 
or of corresponding fields of two registers, or of two fields of the 
same register, if m is set to all 1’s if some condition c is true, 
and to all 0’s if c is false. This gives branch-free code if m can be 
set up without branching. 


2-21 Alternating among Two or More Values 


Suppose a variable x can have only two possible values a and b, 
and you wish to assign to x the value other than its current one, 
and you wish your code to be independent of the values of a and 
b. For example, in a compiler x might be an opcode that is 
known to be either branch true or branch false, and whichever it 
is, you want to switch it to the other. The values of the opcodes 
branch true and branch false are arbitrary, probably defined by a 
C #define Or enum declaration in a header file. 


The straightforward code to do the switch is 
Click here to view code image 


if (x == а) x = b; 
else x = a; 


or, as is often seen in C programs, 
Click here to view code image 
x = x = а? bs a; 


A far better (or at least more efficient) way to code it is either 


x<at+b-x, or 
хє аЬ х. 


If a апа b аге constants, these require only опе or two basic 
RISC instructions. Of course, overflow in calculating a + b can 


be ignored. 


This raises the question: Is there some particularly efficient 
way to cycle among three or more values? That is, given three 
arbitrary but distinct constants a, b, and c, we seek an easy-to- 
evaluate function f that satisfies 


Ка) = b. 
f(b) = c. and 
ποσα 


It is perhaps interesting to note that there is always a 
polynomial for such a function. For the case of three constants, 


: _ (х—а)(х—Ь) (x—b)(x—c), , (v—c)(x-a) , 
Esty (8-8 эг) бра" 
(The idea is that if x = a, the first and last terms vanish, and the 
middle term simplifies to b, and so on.) This requires 14 
arithmetic operations to evaluate, and for arbitrary a, b, and c, 
the intermediate results exceed the computer's word size. But it 
is just a quadratic; if written in the usual form for a polynomial 
and evaluated using Horner's rule,5 it would require only five 
arithmetic operations (four for a quadratic with integer 
coefficients, plus one for a final division). Rearranging Equation 
(5) accordingly gives 


(5) 


fo) = l 5 {Ca - да+ (6 - c) + (с aye? 


(a Бас) — c) 
+ [(а- b)b? + (b—c)c? + (c - a)a?]x 
+ (а – b)a?b + (b— c)5?c + (c  a)ac?]]. 
This is getting too complicated to be interesting, or practical. 
Another method, similar to Equation (5) in that just one of 
the three terms survives, is 


f(x) = ((—(х = с)) & а) + ((-(х = a)) & b) + ((— (x = b) & 
c). 


This takes 11 instructions if the machine has the equal predicate, 
not counting loads of constants. Because the two addition 
operations are combining two 0 values with a nonzero, they can 


be replaced with or or exclusive or operations. 


The formula can be simplified by precalculating a — c and b – 
c, and then using [GLS1]: 


f(x) = ((—(х = с)) & (a — c)) + (( —(x = α)) & (b -- c)) 


+ c, or 


fœ) = ((-(х = с)) & (a © с)) e (Hx = a)) & (b @ c) 
Өс. 


Each of these operations takes eight instructions, but on most 
machines these are probably no better than the straightforward 
C code shown below, which executes in four to six instructions 
for small a, », and c. 


Click here to view code image 


if (x == а) х = b; 
else if (x == D) x = с; 
else x = a; 


Pursuing this matter, there is an ingenious branch-free 
method of cycling among three values on machines that do not 
have comparison predicate instructions [GLS1]. It executes in 
eight instructions on most machines. 


Because a, b, and c are distinct, there are two bit positions, 
nj and ng, where the bits of a, b, and c are not all the same, and 
where the “odd one out” (the one whose bit differs in that 
position from the other two) is different in positions пу and по. 
This is illustrated below for the values 21, 31, and 20, shown in 
binary. 


10101 с 

11111 a 

10100 b 
n п, 


Without loss of generality, rename a, b, and c so that a has 
the odd one out in position пу and b has the odd one out in 
position n2, as shown above. Then there are two possibilities for 
the values of the bits at position nı, namely (ал, bn, сп) = (0, 
1, 1) or (1, 0, 0). Similarly, there are two possibilities for the bits 


at position n2, namely (απο, bn», cn?) = (0, 1, 0) or (1, 0, 1). This 
makes four cases in all, and formulas for each of these cases are 
shown below. 


Case 1. (a, b, , c, ) = (0, 1, 1), (a, Bn,» οι) = (0, 1,0): 


fix) = x, * (a — b) +x, * (c-a)*b 


Case 2. (a 


np b 


n? En) — (0, 1, 1), (а, δ,» ¢,,) =(1, 0, 1): 
Дх) = х, * (a— b) tx, * (a — c)+(b+c-a) 
Case 3. (a, , bno c, ) = (1.0, 0), (a, Bn,» c,.) = (0, 1. 0): 


fx) = Xy * (5—a) * x, * (с-а) +а 


1 


Case 4. (a, , b, Cn) = (1.0, 0), (a, , b, Cn) = (1.0. 1): 


fx) = Xn, * (b — a) tX, *(a—c)*c 


In these formulas, the left operand of each multiplication is a 
single bit. A multiplication by 0 or 1 can be converted into an 
and with a value of O or all 1’s. Thus, the formulas can be 
rewritten as illustrated below for the first formula. 


f(x) = ((х < G1-n,)) > 31)&(a — δ) + (x < (31-п,)) > 31)&(c- a) +b 


Because all variables except x are constants, this can be 
evaluated in eight instructions on the basic RISC. Here again, the 
additions and subtractions can be replaced with exclusive or. 


This idea can be extended to cycling among four or more 
constants. The essence of the idea is to find bit positions п], пэ, 
..., at which the bits uniquely identify the constants. For four 
constants, three bit positions always suffice. Then (for four 
constants) solve the following equation for s, t, u, and v (that is, 
solve the system of four linear equations in which f(x) is a, b, c, 
or d, and the coefficients Xni are 0 or 1): 


ΑΧ) = xms + xnot + хпзи + v 


If the four constants are uniquely identified by only two bit 
positions, the equation to solve is 


Κα) = xms + xnot + xni xnou + v. 


2-22 A Boolean Decomposition Formula 


In this section, we have a look at the minimum number of 
binary Boolean operations, or instructions, that suffice to 
implement any Boolean function of three, four, or five variables. 
By a "Boolean function" we mean a Boolean-valued function of 
Boolean arguments. 


Our notation for Boolean algebra uses “+” for or, 
juxtaposition for and, for exclusive or, and either an overbar or 
a prefix — for not. These operators can be applied to single-bit 
operands or “bitwise” to computer words. Our main result is the 
following theorem: 


THEOREM. If f(x, y, z) is a Boolean function of three variables, 
then it can be decomposed into the form g(x, y) & zh(x, y), where g 
and h are Boolean functions of two variables. 6 


Proof [Ditlow]. f(x, y, z) can be expressed as a sum of 
minterms, and then 2 and z can be factored out of their terms, 
giving 


fox, y, z) = Zfo(x, y) + 2/\(х, у). 
Because the operands to “+” cannot both be 1, the or can be 
replaced with exclusive or, giving 


Дх, у, z) = Σα. y) 9 zf (x, y) 

(1 6 z)fo(x, y) Ө zfi(x, y) 

= folx, y) 9 zfx(x, y) @ zf, (x, у) 
= f(x, y) 9 z(/,(x, y) Dfi, у)). 


where we have twice used the identity (a © b) c = ac @ bc. 


This is in the required form with g(x, y) = fol y) and h(x, y) 
= fox y) ФНС, у). [005 y), incidentally, is f(x, у, z) with z = 
0, and /1(х, y) is f(x, у, z) with z = 1. 

COROLLARY. If a computer's instruction set includes an instruction 
for each of the 16 Boolean functions of two variables, then any 
Boolean function of three variables can be implemented with four (or 


fewer) instructions. 


One instruction implements g(x, y), another implements h(x, y), 
and these are combined with and and exclusive or. 


As an example, consider the Boolean function that is 1 if 
exactly two of x, y, and z are 1: 


Дх, y, z) = xyz + xyz + xyz. 
Before proceeding, the interested reader might like to try to 
implement f with four instructions, without using the theorem. 


From the proof of the theorem, 


Дх, у, 2) = (х,у) Ө z(/,(x, y) BAG, у)) 
= xy @ z(xy @ (xy + xy)) 
= xy @:(x + y). 
which is four instructions. 


Clearly, the theorem can be extended to functions of four or 
more variables. That is, any Boolean function f(x1, x2, ..., Xn) can 
be decomposed into the form g(x1, хо, ..., xn 1) Θ Xph(x1, Хо, -.., 
xn 1). Thus, a function of four variables can be decomposed as 
follows: 


fw, x,y,z) = g(w, x, y) @ zh(w, x, у), where 
g(w,x,y) = gj (w, x) 9 yh (w, x) and 
h(w,x,y) = g3y(w, x) Ө yh,(w, х). 


This shows that a computer that has an instruction for each of 
the 16 binary Boolean functions can implement any function of 
four variables with ten instructions. Similarly, any function of 
five variables can be implemented with 22 instructions. 


However, it is possible to do much better. For functions of 
four or more variables there is probably no simple plug-in 
equation like the theorem gives, but exhaustive computer 
searches have been done. The results are that any Boolean 
function of four variables can be implemented with seven binary 
Boolean instructions, and any such function of five variables can 
be implemented with 12 such instructions [Knu4, 7.1.2]. 


In the case of five variables, only 1920 of the 225 = 
4,294,967,296 functions require 12 instructions, and these 1920 
functions are all essentially the same function. The variations are 
obtained by permuting the arguments, replacing some 
arguments with their complements, or complementing the value 
of the function. 


2-23 Implementing Instructions for All 16 Binary 
Boolean Operations 


The instruction sets of some computers include all 16 binary 
Boolean operations. Many of the instructions are useless in that 
their function can be accomplished with another instruction. For 
example, the function f(x, y) = 0 simply clears a register, and 
most computers have a variety of ways to do that. Nevertheless, 
one reason a computer designer might choose to implement all 
16 is that there is a simple and quite regular circuit for doing it. 


Refer to Table 2-1 on page 17, which shows all 16 binary 
Boolean functions. To implement these functions as instructions, 
choose four of the opcode bits to be the same as the function 
values shown in the table. Denoting these opcode bits by co, c1, 
c2, and сз, reading from the bottom up in the table, and the 
input registers by x and y, the circuit for implementing all 16 
binary Boolean operations is described by the logic expression 


Соху t суху + соху + с;ху. 


For example, with со = c1 = с2 = сз = 0, the instruction 
computes the zero function, f(x, у) = 0. With со = 1 and the 
other opcode bits 0 it is the and instruction. With со = сз = 0 
and с1 = c2 = 1 it is exclusive or, and so forth. 


This can be implemented with n 4:1 MUXs, where n is the 
word size of the machine. The data bits of x and y are the select 
lines, and the four opcode bits are the data inputs to each MUX. 
The MUX is a standard building block in today’s technology, and 
it is usually a very fast circuit. It is illustrated below. 


output 


The function of the circuit is to select co, c1, с2, or сз to be the 
output, depending on whether x and y are 00, 01, 10, or 11, 
respectively. It is like a four-position rotary switch. 


Elegant as this is, it is somewhat expensive in opcode points, 
using 16 of them. There are a number of ways to implement all 
16 Boolean operations using only eight opcode points, at the 
expense of less regular logic. One such scheme is illustrated in 
Table 2-3. 


TABLE 2-3. EIGHT SUFFICIENT BOOLEAN INSTRUCTIONS 


Values Formula Mnemonic (Name) 
0001 xy and 
0010 xy andc (and with complement) 
0110 хӨу xor (exclusive or) 
0111 x+y or 


1110 xy nand (negative and) 


1101 xp.orx+y cor (complement and or) 


1001 х Фу огх=у | еду (equivalence) 


1000 x+y nor (negative or) 


The eight operations not shown in the table can be done with 
the eight instructions shown, by interchanging the inputs or by 
having both register fields of the instruction refer to the same 
register. See exercise 13. 


IBM’s POWFR architecture uses this scheme, with the minor 


difference that POWER has or with complement rather than 
complement and or. The scheme shown in Table 2-3 allows the 
last four instructions to be implemented by complementing the 
result of the first four instructions, respectively. 


Historical Notes 


The algebra of logic expounded in George Boole’s An 
Investigation of the Laws of Thought (1854)7 is somewhat different 
from what we know today as “Boolean algebra.” Boole used the 
integers 1 and O to represent truth and falsity, respectively, and 
he showed how they could be manipulated with the methods of 
ordinary numerical algebra to formalize natural language 
statements involving “and,” “or,” and “except.” He also used 
ordinary algebra to formalize statements in set theory involving 
intersection, union of disjoint sets, and complementation. He 
also formalized statements in probability theory, in which the 
variables take on real number values from 0 to 1. The work 
often deals with questions of philosophy, religion, and law. 


Boole is regarded as a great thinker about logic because he 
formalized it, allowing complex statements to be manipulated 
mechanically and flawlessly with the familiar methods of 
ordinary algebra. 


Skipping ahead in history, there are a few programming 
languages that include all 16 Boolean operations. IBM’s PL/I (ca. 
1966) includes a built-in function named BOOL. In BOOL(x, y, 
2), z is a bit string of length four (or converted to that if 
necessary) and x and y are bit strings of equal length (or 
converted to that if necessary). Argument z specifies the Boolean 
operation to be performed on x and y. Binary 0000 is the zero 
function, 0001 is ху, 0010 is x’, and so forth. 


Another such language is Basic for the Wang System 2200B 
computer (ca. 1974), which provides a version of BOOL that 
operates on character strings rather than on bit strings or 
integers [Neum]. 


Still another such language is MIT PDP-6 Lisp, later called 
MacLisp [GLS1]. 
Exercises 


1. David de Kloet suggests the following code for the snoob 
function, for x = 0, where the final assignment to y is the 


result: 
γε x + (x & —x) 
x — x & —y 
while((x & 1) = 0) x — x 1 
хєх»! 


yey|x 


This is essentially the same as Gosper’s code (page 15), 
except the right shift is done with a while-loop rather than 
with a divide instruction. Because division is usually 
costly in time, this might be competitive with Gosper’s 
code if the while-loop is not executed too many times. Let 
n be the length of the bit strings x and y, k the number of 
1-bits in the strings, and assume the code is executed for 
all values of x that have exactly k 1-bits. Then for each 
invocation of the function, how many times, on average, 
will the body of the while-loop be executed? 


2. The text mentions that a left shift by a variable amount is 
not right-to-left computable. Consider the function x << 
(x & 1) [Knu8]. This is a left shift by a variable amount, 
but it can be computed by 


x + (x & 1) *x, or 
x + (x & (— (x & 1))), 


which are all right-to-left computable operations. What is 
going on here? Can you think of another such function? 


3. Derive Dietz’s formula for the average of two unsigned 
integers, 


(x & y) + ((x Ө y) > 1). 
4. Give an overflow-free method for computing the average 
of four unsigned integers, | (a + b + c + d)/4j. 


5. Many of the comparison predicates shown on page 23 can 
be simplified substantially if bit 31 of either x or y is 


10. 


11. 


12. 


13. 


14. 


known. Show how the seven-instruction expression for 


u 
X SY can be simplified to three basic RISC, non- 
comparison, instructions if узі = 0 


. Show that if two numbers, possibly distinct, are added 


with “end-around carry," the addition of the carry bit 
cannot generate another carry out of the high-order 
position. 


. Show how end-around carry can be used to do addition if 


negative numbers are represented in one's-complement 
notation. What is the maximum number of bit positions 
that a carry (from any bit position) might be propagated 
through? 


. Show that the MUX operation, (x & rn) | (y & ~m), can be 


done in three instructions on the basic RISC (which does 
not have the and with complement instruction). 


. Show how to implement x Ф y in four instructions with 


and-or-not logic. 


Given a 32-bit word x and two integer variables i and j (in 
registers), show code to copy the bit of x at position i to 
position j. The values of i and j have no relation, but 
assume that 0 < i, j < 31. 


How many binary Boolean instructions are sufficient to 
evaluate any n-variable Boolean function if it is 
decomposed recursively by the method of the theorem? 


Show that alternative decompositions of Boolean 
functions of three variables are 


(a) f(x, y, z) = g(x, y) Ө Zh y) (the “negative Davio 
decomposition”), and 
(b) f@ у, 2) = g(x, у) e (z + h (x, y». 


It is mentioned in the text that all 16 binary Boolean 
operations can be done with the eight instructions shown 
in Table 2-3, by interchanging the inputs or by having 
both register fields of the instruction refer to the same 
register. Show how to do this. 


Suppose you are not concerned about the six Boolean 
functions that are really constants or unary functions, 


namely f(x, y) = 0, 1, x, y, X, and У, but you want your 
instruction set to compute the other ten functions with 


one instruction. Can this be done with fewer than eight 
binary Boolean instruction types (opcodes)? 


15. Exercise 13 shows that eight instruction types suffice to 
compute any of the 16 two-operand Boolean operations 
with one R-R (register-register) instruction. Show that six 
instruction types suffice in the case of R-I (register- 
immediate) instructions. With R-I instructions, the input 
operands cannot be interchanged or equated, but the 
second input operand (the immediate field) can be 
complemented or, in fact, set to any value at no cost in 
execution time. Assume for simplicity that the immediate 
fields are the same length as the general purpose 
registers. 


16. Show that not all Boolean functions of three variables can 
be implemented with three binary logical instructions. 


Chapter 3. Power-of-2 Boundaries 


3-1 Rounding Up/Down to a Multiple of a Known 
Power of 2 


Rounding an unsigned integer x down to, for example, the next 
smaller multiple of 8 is trivial: x & —8 does it. An alternative is 
„и йы 
(x >> 3) << 3. These work for signed integers as well, provided 
"round down" means to round in the negative direction (e.g., 
(—37) & (—8) = —40). 
Rounding up is almost as easy. For example, an unsigned 


integer x can be rounded up to the next greater multiple of 8 
with either of 


(х+7) & -8, or 

x t (—x & 7). 
These expressions are correct for signed integers as well, 
provided “round up” means to round in the positive direction. 
The second term of the second expression is useful if you want 


to know how much you must add to x to make it a multiple of 8 
[Gold]. 


To round a signed integer to the nearest multiple of 8 toward 
0, you can combine the two expressions above in an obvious 
way: 


t — (х5 31) & 7: 
(x+) & -8 
> 2) > 2 


An alternative for the first line is £ €— (X > 29 which 
is useful if the machine lacks and immediate, or if the constant is 


too large for its immediate field. 


Sometimes the rounding factor is given as the logo of the 
alignment amount (e.g., a value of 3 means to round to a 
multiple of 8). In this case, code such as the following can be 
used, where К = log»(alignment amount): 


round down: x & ((-1) < Kk) 
(x > k) < k 
round up: (< (1 < k)— 1: (x^*t)&-t 
t< (-1) < k: (x-t-1) &t 
3-2 Rounding Up/Down to the Next Power of 2 


We define two functions that are similar to floor and ceiling, but 
which are directed roundings to the closest integral power of 2, 
rather than to the closest integer. Mathematically, they are 
defined by 


kas x< 0. asus x«0. 
flp2(x) = 40. x ^ 0. clp2(x) = 40, x = 0. 
| 2119823 j: otherwise: | авы . otherwise. 


The initial letters of the function names are intended to suggest 
"floor" and *ceiling." Thus, flp2(x) is the greatest power of 2 
that is <x, and clp2(x) is the least power of 2 that is > x. These 
definitions make sense even when x is not an integer (e.g., 
flp2(0.1) = 0.0625). The functions satisfy several relations 
analogous to those involving floor and ceiling, such as those 
shown below, where n is an integer. 


Lx] = [x] iffxisaninteger flp2(x) = clp2(x) iff x is a power of or is 0 

[x+n] = [x]*n flp2(2"x) = 2"flp2(x) 

[x] = -|-х | clp2(x) = 1/flp2(1/x), x #0 

Computationally, we deal only with the case in which x is an 

integer, and we take it to be unsigned, so the functions are well 
defined for all x. We require the value computed to be the 
arithmetically correct value modulo 232 (that is, we take clp2(x) 
to be 0 for x > 231). The functions are tabulated below for a 
few values of x. 


x flp2(x) clp2(x) 
0 0 0 


2 2 2 
3 2 4 
4 1 4 
5 4 8 
231 e] 230 231 
231 231 931 
231 + ] 231 0 
232 1 231 0 


Functions flp2 and clp2 are connected by the relations shown 
below. These can be used to compute one from the other, subject 
to the indicated restrictions. 


clp2(x) = 2flp2(x - 1). x1, 


= flp2(2x - 1), I <x <231, 
Їїр2(х) = cip2(x 22* 1), x0, 


= с1р2(х+1) #2, x«23!. 


The round-up and round-down functions can be computed 
quite easily with the number of leading zeros instruction, as 
shown below. However, for these relations to hold for x = 0 and 
x > 231, the computer must have its shift instructions defined to 
produce 0 for shift amounts of -1, 32, and 63. Many machines 
(e.g., PowerPC) have “mod-64” shifts, which do this. In the case 
of —1, it is adequate if the machine shifts in the opposite 
direction (that is, a shift left of -1 becomes a shift right of 1). 


flp2(x) = 1 << (31 — nlz(x)) 
1 < (nlz(x) @ 31) 


| 


0х80000000 = nlz(x) 
clp2(x) = 1 << (32 - nlz(x — 1)) 


0x80000000 > (nlz(x — 1) — 1) 
Rounding Down 


Figure 3-1 illustrates a branch-free algorithm that might be 
useful if number of leading zeros is not available. This algorithm 
is based on right-propagating the leftmost 1-bit, and executes in 
12 instructions. 


Click here to view code image 
2(unsigned x) { 
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x >> 16); 
return x = (x >> 1); 


FIGURE 3-1. Greatest power of 2 less than or equal to x, 
branch free. 


Figure 3-2 shows two simple loops that compute the same 
function. All variables are unsigned integers. The loop on the 
right keeps turning off the rightmost 1-bit of x until x = 0, and 
then returns the previous value of x. 


Click here to view code image 


y = 0x80000000; do { 
while (y > x) у = x; 

у = У >> 1; x = x & (x - 1); 
return y; } while(x != 0); 


return y; 


FIGURE 3-2. Greatest power of 2 less than or equal to x, 
simple loops. 


The loop on the left executes in 4nlz(x) + 3 instructions. The 
loop on the right, for = 0, executes in 4 pop(x) instructions,1 
if the comparison to 0 is zero-cost. 


Rounding Up 


The right-propagation trick yields a good algorithm for rounding 
up to the next power of 2. This algorithm, shown in Figure 3-3, 
is branch free and runs in 12 instructions. 


Click here to view code image 


unsigned clp2(unsigned x) { 


х= х - 1; 

x =x | (x >> 1); 
x =x | (x >> 2); 
x =x | (x >> 4); 
x = x | (x >> 8); 
x = x | (x >> 16); 


return x + 1; 


FIGURE 3-3. Least power of 2 greater than or equal to x. 


An attempt to compute this with the obvious loop does not 
work out very well: 


Click here to view code image 


у = 1; 

while (у < x) // Unsigned comparison. 
y = 2*y; 

return y; 


This code returns 1 for x — 0, which is probably not what you 
want, loops forever for x > 231, and executes in 4n + 3 
instructions, where n is the power of 2 of the returned integer. 
Thus, it is slower than the branch-free code, in terms of 
instructions executed, for nz 3 (xz 8). 


3-3 Detecting a Power-of-2 Boundary Crossing 


Assume memory is divided into blocks that are a power of 2 in 
size, starting at address 0. The blocks may be words, 
doublewords, pages, and so on. Then, given a starting address a 
and a length l, we wish to determine whether or not the address 
range from a to a + l-1, l = 2, crosses a block boundary. The 
quantities a and [ are unsigned and any values that fit in a 
register are possible. 


Ifl = 0 or 1, a boundary crossing does not occur, regardless 
of a. If [ exceeds the block size, a boundary crossing does occur, 
regardless of a. For very large values of | (wraparound is 
possible), a boundary crossing can occur even if the first and last 
bytes of the address range are in the same block. 


There is a surprisingly concise way to detect boundary 
crossings on the IBM System/370 [CJS]. This method is 
illustrated below for a block size of 4096 bytes (a common page 
size). 


Click here to view code image 
O  RA,-A(-4096) 


ALR RA,RL 
BO CROSSES 


The first instruction forms the logical or of RA (which contains 
the starting address a) and the number ΟΧΕΕΕΕΕΟΟΟ. The second 
instruction adds in the length and sets the machine's 2-bit 
condition code. For the add logical instruction, the first bit of the 
condition code is set to 1 if a carry occurred, and the second bit 
is set to 1 if the 32-bit register result is nonzero. The last 
instruction branches if both bits are set. At the branch target, RA 
will contain the length that extends beyond the first page (this is 
an extra feature that was not asked for). 


If, for example, a — 0 and | — 4096, a carry occurs, but the 
register result is 0, so the program properly does not branch to 
label CROSSES. 


Let us see how this method can be adapted to RISC machines, 
which generally do not have branch on carry and register result 
nonzero. Using a block size of 8 for notational simplicity, the 
method of [CJS] branches to CROSSES if a carry occurred ((a | — 
8) + l= 232) and the register result is nonzero ((α | -8) + l = 
232). Thus, it is equivalent to the predicate 


(a|-8) + 1 > 232, 


This in turn is equivalent to getting a carry in the final addition 
in evaluating ((a | -8) — 1) + L If the machine has branch on 
carry, this can be used directly, giving a solution in about five 
instructions, counting a load of the constant — 8. 


If the machine does not have branch on carry, we can use the 


Mi ie 
fact that carry occurs in x + y iff mX < у (see “Unsigned Add/ 
Subtract" on page 31) to obtain the expression 


-((4 | -8)-1) 51. 
Using various identities such as —(х — 1) = -х gives the 
following equivalent expressions for the “boundary crossed” 
predicate: 


—(а | -8) 21 
—(а | -8)* 121 
(-а8:7)-121 


These can be evaluated in five or six instructions on most RISC 
computers, counting the final conditional branch. 


Using another tack, clearly an 8-byte boundary is crossed iff 
(a&7) +1-1= 8. 


This cannot be directly evaluated because of the possibility of 
overflow (which occurs if l is very large), but it is easily 
rearranged to 8 — (a & 7) < 1, which can be directly evaluated 
on the computer (no part of it overflows). This gives the 
expression 


8 — (a & 7) 1, 


which can be evaluated in five instructions on most RISCs (four 
if it has subtract from immediate). If a boundary crossing occurs, 
the length that extends beyond the first block is given by І — (8 
— (a & 7), which can be calculated with one additional 
instruction (subtract). 


This formula can be easily understood from the figure below 


[Kumar], which illustrates that a & 7 is the offset of a in its 
block, and thus 8 — (a & 7) is the space remaining in the block. 


a 


Exercises 


1. Show how to round an unsigned integer to the nearest 
multiple of 8, with the halfway case (a) rounding up, (b) 
rounding down, and (c) rounding up or down, whichever 
makes the next bit to the left a zero (“unbiased” 
rounding). 


2. Show how to round an unsigned integer to the nearest 
multiple of 10, with the halfway case (a) rounding up, (b) 
rounding down, and (c) rounding up or down, whichever 
results in an even multiple of 10. Feel free to use division, 
remaindering, and multiplication instructions, and don't 
be concerned about values very close to the largest 
unsigned integer. 


3. Code a function in C that does an *unaligned load." The 
function is given an address a and it loads the four bytes 
from addresses a through a + 3 into a 32-bit GPR, as if 
those four bytes contained an integer. Parameter a 
addresses the low-order byte (that is, the machine is 
little-endian). The function should be branch free, it 
should execute at most two load instructions and, if a is 
full-word aligned, it must not attempt to load from 
address a + 4, because that may be in a read-protected 
block. 


Chapter 4. Arithmetic Bounds 


4-1 Checking Bounds of Integers 


By "bounds checking" we mean to verify that an integer x is 
within two bounds a and b—that is, that 


We first assume that all quantities are signed integers. 


An important application is the checking of array indexes. For 
example, suppose a one-dimensional array A can be indexed by 
values from 1 to 10. Then, for a reference A(i), a compiler might 
generate code to check that 


15110 


and to branch or trap if this is not the case. In this section we 
show that this check can be done with a single comparison, by 
performing the equivalent check [PL8]: 


i-1<9. 


This is probably better code, because it involves only one 
compare-branch (or compare-trap), and because the quantity i- 1 
is probably needed anyway for the array addressing calculations. 


Does the implementation 


asxsb>x-a<b-a 

always work, even if overflow may occur in the subtractions? It 
does, provided we somehow know that a < b. In the case of 
array bounds checking, language rules may require that an array 
not have a number of elements (or number of elements along 
any axis) that are 0 or negative, and this rule can be verified at 
compile time or, for dynamic extents, at array allocation time. In 
such an environment, the transformation above is correct, as we 
will now show. 


It is convenient to use a lemma, which is good to know in its 
own right. 


LEMMA. If a and b are signed integers and a < b, then the 
computed value b — a correctly represents the arithmetic value b — 
a, if the computed value is interpreted as unsigned. 


Proof. (Assume a 32-bit machine.) Because a x b, the true 
difference b — a is in the range 0 to (231 — 1) — (—231) = 232 
— 1. If the true difference is in the range 0 to 231 — 1, then the 
machine result is correct (because the result is representable 
under signed interpretation), and the sign bit is off. Hence the 
machine result is correct under either signed or unsigned 
interpretation. 


If the true difference is in the range 2?! to 232 — 1, then the 
machine result will differ by some multiple of 232 (because the 
result is not representable under signed interpretation). This 
brings the result (under signed interpretation) to the range — 231 
to —1. The machine result is too low by 222, and the sign bit is 
on. Reinterpreting the result as unsigned increases it by 232, 
because the sign bit is given a weight of + 23! rather than 
— 231, Hence the reinterpreted result is correct. 


The *bounds theorem" is 
THEOREM. If a and b are signed integers and a < b, then 


а<х<Ь = x-a*cb-a. (1) 


Proof. We distinguish three cases, based on the value of x. In 
all cases, by the lemma, since a < b, the computed value b - a is 
equal to the arithmetic value b — a if b — a is interpreted as 
unsigned, as it is in Equation (1). 


Casel,x « a: In this case, x — a interpreted as unsigned is x- 
a + 232, Whatever the values of x and b are (within the range of 
32-bit numbers), 


x + 232 > b. 
Therefore 
x- a + 232 > b—a, 
and hence 


x-a>b-a. 
In this case, both sides of Equation (1) are false. 


Case 2, а < x x b: Then, arithmetically, x — а < b — a. 
Because a x x, by the lemma x — a equals the computed value 
x-—aifthe latter is interpreted as unsigned. Hence 


x—a<b-a: 
that is, both sides of Equation (1) are true. 


Case 3, x > b: Then x — a > b — a. Because in this case x > 
a (because b = a), by the lemma x — a equals the value of x — 
a if the latter is interpreted as unsigned. Hence 


x-a>b-a; 
that is, both sides of Equation (1) are false. 
The theorem stated above is also true if a and b are unsigned 


integers. This is because for unsigned integers the lemma holds 
trivially, and the above proof is also valid. 


Below is a list of similar bounds-checking transformations, 
with the theorem above stated again. These all hold for either 
signed or unsigned interpretations of a, b, and x. 


ifa<bthena<x<b = х-а<ь-а = b-x*b-a 
ifa<bthena<x<b = x—a“ b-a 
ifa<bthena<x<b = b-x“Z b-a 


ifa<bthena<x<b =x-a-12b-a-1 = b-x-14b-a-1 
In the last rule, b — a — 1 can be replaced with b + ~a. 


There are some quite different transformations that may be 
useful when the test is of the form —2n-1 < x < 2r! - 1. This is 
a test to see if a signed quantity x can be correctly represented 
as an n-bit two’s-complement integer. To illustrate with n = 8, 
the following tests are equivalent: 


a. —128 < x < 127 


b. x + 128 < 255 

с. (х> 7) +1<1 

д. x> 7 = х> 31 

е. (x > 7) + (x > 31) = 0 


Е (х«24)»-24-х 


g. x Ө (x 331) < 127 


Equation (b) is simply an application of the preceding material 
in this section. Equation (c) is as well, after shifting x right seven 
positions. Equations (с) - (f) and possibly (g) are probably useful 
only if the constants in Equations (a) and (b) exceed the size of 
the immediate fields of the computer's compare and add 
instructions. 


Another special case involving powers of 2 is 


0<х<2"— 1 < (x 2 n) = 0. 


or, more generally, 


а<х<а+2"— 1 = ((x —a)> n) = 0. 


4-2 Propagating Bounds through Adds and 
Subtract's 


Some optimizing compilers perform “range analysis” of 
expressions. This is the process of determining, for each 
occurrence of an expression in a program, upper and lower 
bounds on its value. Although this optimization is not a really 
big winner, it does permit improvements such as omitting the 
range check on a C “switch” statement and omitting some 
subscript bounds checks that compilers may provide as a 
debugging aid. 

Suppose we have bounds on two variables x and y as follows, 


where all quantities are unsigned: 


asx<h, and 
с<у< а. 


Then, how can we compute tight bounds on x + y, x - y, and - 
x? Arithmetically, of course, a + с € x + y < b + d; but the 
point is that the additions may overflow. 


(3) 


The way to calculate the bounds is expressed in the 
following: 


THEOREM. If a, b, c, d, x, and y are unsigned integers and 
a<x<b and 


и и 
c € y € d. 
then 


0<х+у<2?—1 if а+с<232-1 and b+ d> 232, (4) 


at+e<x+y<b+d otherwise; 


0<х-у<2%?—1 if a-d«0 and b-c20, 


a-dEx-ytb-c otherwise; 


05-х5252-1 if a=0 and b#0, е 
) 


-bt-xt-a otherwise. 

Inequalities (4) say that the bounds on x + y are “normally” 
a + c and b + d, but if the calculation of a + c does not 
overflow and the calculation of b + d does overflow, then the 
bounds are 0 and the maximum unsigned integer. Equations (5) 
are interpreted similarly, but the true result of a subtraction 
being less than O constitutes an overflow (in the negative 
direction). 


Proof. If neither a + c nor b + d overflows, then x + y, with 
x and y in the indicated ranges, cannot overflow, making the 
computed results equal to the true results, so the second 
inequality of (4) holds. If both a + c and b + d overflow, then 
so also does x + y. Now arithmetically, it is clear that 


а + с—232 < x + y —232 < b + d — 232. 


This is what is calculated when the three terms overflow. Hence, 
in this case also, 


а+с<х+у<Ь+4@. 
If a + c does not overflow, but b + d does, then 


а + с < 232 — 1 andb + d = 232. 


Because x + y takes on all values in the range a + c to b + d, it 
takes on the values 232 — 1 and 23?—that is, the computed 
value x + y takes on the values 232 — 1 and 0 (although it 
doesn't take on all values in that range). 


Lastly, the case that a + c overflows, but b + d does not, 
cannot occur, because a x b and c x d. 


This completes the proof of inequalities (4). The proof of (5) 
is similar, but “overflow” means that a true difference is less 
than 0. 


Inequalities (6) can be proved by using (5) with а = b = 0, 
and then renaming the variables. (The expression - x with x an 
unsigned number means to compute the value of 232 — x, or of 
—х + 1 if you prefer.) 


Because unsigned overflow is so easy to recognize (see 
“Unsigned Add/Subtract” on page 31), these results are easily 
embodied in code, as shown in Figure 4-1, for addition and 
subtraction. The computed lower and upper limits are variables 
s and t, respectively. 


з= а+с; s=a-d; 

t = b + d; t = b - c; 

if (s >= a && t < b) ( if (s > a && t <= b) ( 
s 0; ; 


xFFFFFFFF;) 


=0 
=0 


= s 
t = OxFFFFFFFF;) t 


FIGURE 4-1. Propagating unsigned bounds through addition 
and subtraction operations. 


Signed Numbers 


The case of signed numbers is not so clean. As before, suppose 
we have bounds on two variables x and y as follows, where all 


quantities are signed: 


a<x<b. and 
csy<d. 
We wish to compute tight bounds on x + y, x — y, and - x. The 


reasoning is very similar to that for the case of unsigned 
numbers, and the results for addition are shown below. 


ate<-2lb+d<-Bl:atesxtysbtd 


at+c<-23!,b+d2>-23!: —231 <х+у< 231 -1 
—231 <а+с<231, 6 +4<2'1 : а+с<х+у<ь+а (7) 
—231 <а+с< 21, b+d2 23! : —231 <х+у< 231-1 


а+с>23%\,Ь+4>23\:а+с<х+у<Ь+4@ 


The first row means that if both of the additions а + c and b 
+ d overflow in the negative direction, then the computed sum 
x + y lies between the computed sums a + c and b + d. This is 
because all three computed sums are too high by the same 
amount (232). The second row means that if the addition a + c 
overflows in the negative direction, and the addition b + d 
either does not overflow or overflows in the positive direction, 
then the computed sum x + y can take on the extreme negative 
number and the extreme positive number (although perhaps not 
all values in between), which is not difficult to show. The other 
rows are interpreted similarly. 


The rules for propagating bounds on signed numbers through 
the subtraction operation can easily be derived by rewriting the 
bounds on y as 


-а=-у= -с 
and using the rules for addition. The results аге shown below. 
a—d<-—23!,b-—c<-23!:a-d<x-y<b-c 
a-—d<-23!,b-—c2-23! :-23!<x-y<23!-1 
—231<aq-—d<23!,b-c<23!:a-—d<x-y<b-c 


< 
—231 < a—d< 23), b—c2z2?1: 231 «x -p 231-1 


a-dz23),b-c2z2!:a-dsx-yzb-c 


The rules for negation can be derived from the rules for 
subtraction by taking a = b = 0, omitting some impossible 
combinations, simplifying, and renaming. The results are as 
follows: 


a=—231 Б = —231 —x = —231 
а =—231 pz —231 231 4—x $2?! — 1 
az—23!:—b<-x<-a 


C code for the case of signed numbers is a bit messy. We will 
consider only addition. It seems to be simplest to check for the 
two cases in (7) in which the computed limits are the extreme 
negative and positive numbers. Overflow in the negative 
direction occurs if the two operands are negative and the sum is 
nonnegative (see “Signed Add/Subtract” on page 28). Thus, to 
check for the condition that a + c < -231, we could let s = a + 
c; and then code something like “if (a < 0 && c < 0 && s >= 
0) ....” It will be more efficient,1 however, to perform logical 
operations directly on the arithmetic variables, with the sign bit 
containing the true/false result of the logical operations. Then, 
we write the above condition as “if ((a & c & <s) < 0) ...." 
These considerations lead to the program fragment shown in 
Figure 4-2. 


Click here to view code image 


S a + c; 
t=bt+d; 
u=aé&ce&w~s & ~(b & d &~1); 
у = ((a ^ с) | “(a ^ s)) & (^b & ~d & t); 
if ((u | v) < O) í 

s = 0x80000000; 

t = Ox7FFFFFFF; } 


FIGURE 4-2. Propagating signed bounds through an addition 
operation. 


Here u is true (sign bit is 1) if the addition a + c overflows 
in the negative direction, and the addition b + a does not 
overflow in the negative direction. Variable v is true if the 
addition a + c does not overflow and the addition ь + а 
overflows in the positive direction. The former condition can be 


expressed as “a and c have different signs, or a and s have the 
same sign." The “it” test is equivalent to “if (u < 0 || v « 0) 
—that is, if either u or v is true." 


4-3 Propagating Bounds’ through Logical 
Operations 


As in the preceding section, suppose we have bounds on two 
variables x and y as follows, where all quantities are unsigned: 


а<х<Ь. and 
csy<d. 


Then what are some reasonably tight bounds on x | y, x & y, x 
Q y, and ^x? 


(8) 


Combining inequalities (8) with some inequalities from 
Section 2-3 on page 17, and noting that ^x = 232 — 1 — x, 
yields 


тах(а, c) X (x | y) € b + d, 
0 < (x & y) € min(5, d). 
0<(x®y)<sb+d, and 


-051-хк5-а, 


where it is assumed that the addition b + d does not overflow. 
These are easy to compute and might be good enough for the 
compiler application mentioned in the preceding section; 
however, the bounds in the first two inequalities are not tight. 
For example, writing constants in binary, suppose 


00010 < x 00100. and 


01001 < y < 10100. 


Then, by inspection (e.g., trying all 36 possibilities for x and y), 
we see that 01010 < (x | y) < 10111. Thus, the lower bound is 
not max(a, c), nor is it a | c, and the upper bound is not b + d, 
nor is it b | d. 


(9) 


Given the values of a, b, c, and d in inequalities (8), how can 
one obtain tight bounds on the logical expressions? Consider 


first the minimum value attained by x | y. A reasonable guess 
might be the value of this expression with x and y both at their 
minima—that is, a | c. Example (9), however, shows that the 
minimum can be lower than this. 


To find the minimum, our procedure is to start with x — a 
and y — c, and then find an amount by which to increase either 
x or y so as to reduce the value of x | y. The result will be this 
reduced value. Rather than assigning a and c to x and y, we 
work directly with a and c, increasing one of them when doing 
so is valid and it reduces the value of a | c. 


The procedure is to scan the bits of a and c from left to right. 
If both bits are 0, the result will have a О in that position. If both 
bits are 1, the result will have a 1 in that position (clearly, no 
values of x and y could make the result less). In these cases, 
continue the scan to the next bit position. If one scanned bit is 1 
and the other is 0, then it is possible that changing the 0 to 1 
and setting all the following bits in that bound's value to O will 
reduce the value of a | c. This change will not increase the value 
of a | c, because the result has a 1 in that position anyway, from 
the other bound. Therefore, form the number with the O 
changed to 1 and subsequent bits changed to 0. If that is less 
than or equal to the corresponding upper limit, the change can 
be made; do it, and the result is the or of the modified value 
with the other lower bound. If the change cannot be made 
(because the altered value exceeds the corresponding upper 
bound), continue the scan to the next bit position. 


That's all there is to it. It might seem that after making the 
change the scan should continue, looking for other opportunities 
to further reduce the value of a | c. However, even if a position 
is found that allows a О to be changed to 1, setting the 
subsequent bits to O does not reduce the value of a | c, because 
those bits are already 0. 


C code for this algorithm is shown in Figure 4-3. We assume 
that the compiler will move the subexpressions -a & c and a « 
~c out of the loop. More significantly, if the number of leading 
zeros instruction is available, the program can be speeded up by 
initializing m with 
Click here to view code image 


m = 0х80000000 >> nlz(a ^ с); 


Click here to view code image 


unsigned minOR(unsigned a, unsigned b, 
unsigned c, unsigned d) { 
unsigned m, temp; 


m = 0x80000000; 
while (m != 0) { 
if (~a 8 c & m) { 
temp = (a | m)& -m; 
if (temp <= b) {а = temp; break; } 
} 
else if (a & -ς & m) Í 
temp = (c | m) & -m; 
if (temp <= d) {с = temp; break; } 


} 
m = m >> 1; 
} 


return a | c; 


FIGURE 4-3. Minimum value of x | y with bounds on x and y. 


This skips over initial bit positions in which a and c are both 0 
ог both 1. For this speedup to be effective when a ^ c is 0 (that 
is, when a = c), the machine's shift right instruction should be 
mod-64. If number of leading zeros is not available, it may be 
worthwhile to use some version of the flp2 function (see page 
60) with argument a c. 


Now let us consider the maximum value attained by x |y, with 
the variables bounded as shown in inequalities (8). The 
algorithm is similar to that for the minimum, except it scans the 
values of bounds b and d (from left to right), looking for a 
position in which both bits are 1. If such a position is found, the 
algorithm tries to increase the value of c | d by decreasing one of 
the bounds by changing the 1 to 0, and setting all subsequent 
bits in that bound to 1. If this is acceptable (if the resulting value 
is greater than or equal to the corresponding lower bound), the 
change is made and the result is the value of c | d using the 
modified bound. If the change cannot be done, it is attempted on 
the other bound. If the change cannot be done to either bound, 
the scan continues. C code for this algorithm is shown in Figure 
4—4. Here the subexpression b & а can be moved out of the 
loop, and the algorithm can be speeded up by initializing m with 


Click here to view code image 


unsigned maxOR(unsigned a, unsigned b, 
unsigned c, unsigned d) { 
unsigned m, temp; 


m = 0x80000000; 


while (m != 0) { 
if (р & d & m) Í 
temp = (b - m) | (m - 1); 
if (temp >= a) (b = temp; break; } 
temp = (d - m) | (m - 1); 
) 


if (temp >= c) (d = temp; break; } 


} 

m = m >> 1; 
} 
return b | d; 


FIGURE 4-4. Maximum value of x | y with bounds on x and y. 
Click here to view code image 
m = 0x80000000 >> nlz(b & d); 


There are two ways in which we might propagate the bounds 
of inequalities (8) through the expression x & y: algebraic and 
direct computation. The algebraic method uses DeMorgan’s rule: 


х&у = —(-х | y) 


Because we know how to propagate bounds precisely through or, 
and it is trivial to propagate them through not ( 
а<х<ЁВЬ‹» j Ь<— х < ~ а) wehave 


minAND(a, b, c, d) = —maxOR(-b, -а, -d, —с), 
and 
maxAND(a, b, c, d) = ^minOR(-b, ^a, ^d, ^c). 


For the direct computation method, the code is very similar 
to that for propagating bounds through or. It is shown in Figures 
4—5 and 4-6. 


Click here to view code image 


unsigned minAND(unsigned a, unsigned b, 
unsigned c, unsigned d) { 
unsigned m, temp; 


m — 0x80000000; 


while (m != 0) { 
if (~a & ~c & m) { 
temp = (a | m) & -m; 
if (temp <= b) {а = temp; break;] 
temp = (c | m) & -m; 
if (temp <= d) {с = temp; break;] 


return a & Cc; 


FIGURE 4-5. Minimum value of x& y with bounds on x and y. 


Click here to view code image 


unsigned maxAND(unsigned a, unsigned b, 
unsigned c, unsigned d) { 
unsigned m, temp; 


m — 0x80000000; 


while (m != 0) { 
if (b & ~d & m) { 
temp = (b & ~m) | (m - 1); 


if (temp >= a) (b = temp; break; } 
} 
else if (~b&dé&m 
temp = (d & ~m) 


) Í 
| (m - 1); 
if (temp >= c) {а = 


temp; break;] 


} 

m = πι >> 1; 
} 
return b & d; 


FIGURE 4-6. Maximum value of x& y with bounds on x and y. 


The algebraic method of finding bounds on expressions in 
terms of the functions for and, or, and not works for all the 
binary logical expressions except exclusive or and equivalence. 
The reason these two present a difficulty is that when expressed 
in terms of and, or, and not, there are two terms containing x 


and y. For example, we are to find 


min (x € y) = min ((x &—y) | (~x & y)). 
а<х<Ь а<х<Ь 
с<у<4 с<у<4 
The two operands of the or cannot be separately minimized 
(without proof that it works, which actually it does), because we 
seek one value of x and one value of y that minimizes the whole 
or expression. 


The following expressions can be used to propagate bounds 
through exclusive or: 


minXOR(a, b, c, d) = minAND(a, b, ^d, ^c) | тіпАМО(~ Б, ^a, с, d). 
maxXOR(a, b, c, d) = maxOR(0, maxAND(a, b, ^d, ^c). 
0, maxAND(- 5, — а, c, d)). 

It is straightforward to evaluate the minXOR and maxXOR 
functions by direct computation. The code for minXOR is the 
same as that for minOR (Figure 4-3) except with the two break 
statements removed, and the return value changed to а ^ c. The 
code for maxXOR is the same as that for maxOR (Figure 4-4) 
except with the four lines under the if clause replaced with 


Click here to view code image 


temp = (b-m) | (m - 1); 
if (temp >= a) b = temp; 
else { 

temp - (d - m) 


| (m - 1); 
if (temp >= c) d = temp; 


} 
and the return value changed to b ^ a. 


Signed Bounds 


If the bounds are signed integers, propagating them through 
logical expressions is substantially more complicated. The 
calculation is irregular if 0 is within the range a to Б, or c to d. 
One way to calculate the lower and upper bounds for the 
expression x | y is shown in Table 4-1. A “+” entry means that 
the bound at the top of the column is greater than or equal to 0, 
and a “—” entry means that it is less than 0. The column labeled 
“minOR (signed)” contains expressions for computing the lower 


bound of x | y, and the last column contains expressions for 
computing the upper bound of x | y. One way to program this is 
to construct a value ranging from 0 to 15 from the sign bits of a, 
b, c, and d, and use a “switch” statement. Notice that not all 
values from 0 to 15 are used, because it is impossible to have a 
> borc > d. 


TABLE 4-1. SIGNED MINOR AND MAXOR FROM UNSIGNED 
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minOR(a, b, c, d) maxOR(a, b, c, d) 
a -1 
minOR(a, b, c, d) maxOR(a, b, c, d) 
c -1 


тіп(а, c) тахОК(0, b, 0, d) 


minOR(a, ΘΧΕΕΕΕΕΕΕΕ c, d) тахОК(0, 5, c, d) 


minOR(a, b, c, d) maxOR(a, b, c, d) 
minORt(a, b, c, OXFFFFFFFF) maxOR(a, b, 0, d) 
minOR(a, b, c, d) maxOR(a, b, c, d) 


For signed numbers, the relation 
a < x < b @ -b < —x < -a 


holds, so the algebraic method can be used to extend the results 
of Table 4—1 to other logical expressions (except for exclusive or 
and equivalence). We leave this and similar extensions to others. 


Exercises 


1. For unsigned integers, what are the bounds on x - y if 
0<x<b and 


0<у< 4? 
2. Show how the maxOR function (Figure 4-4) can be 


simplified if either a = o or c = 0 on a machine that 
has the number of leading zeros instruction. 


Chapter 5. Counting Bits 


5-1 Counting 1-Bits 


The IBM Stretch computer (ca. 1960) had a means of counting 
the number of 1-bits in a word, as well as the number of leading 
O's. It produced these two quantities as a by-product of all 
logical operations! The former function is sometimes called 
population count (e.g., on Stretch and the SPARCv9). 


For machines that don't have this instruction, a good way to 
count the number of 1-bits is to first set each 2-bit field equal to 
the sum of the two single bits that were originally in the field, 
and then sum adjacent 2-bit fields, putting the results in each 4- 
bit field, and so on. A more complete discussion of this trick is in 
[RND]. The method is illustrated in Figure 5-1, in which the 
first row shows a computer word whose 1-bits are to be 
summed, and the last row shows the result (23 decimal). 


101111000110001101 ΓΙ 11101 L1 11 111 
У 


ШЕТШЕ ЖЖП ЖЕЕ 
0011001000100010001100110100(0100 


00000101000001000000011000001000 


00000000000010010000000000001110 


00000000000000000000000000010111 


FIGURE 5-1. Counting 1-bits, “divide and conquer" strategy. 


This is an example of the *divide and conquer" strategy, in 
which the original problem (summing 32 bits) is divided into 
two problems (summing 16 bits), which are solved separately, 


and the results are combined (added, in this case). The strategy 
is applied recursively, breaking the 16-bit fields into 8-bit fields, 
and so on. 


In the case at hand, the ultimate small problems (summing 
adjacent bits) can all be done in parallel, and combining 
adjacent sums can also be done in parallel in a fixed number of 
steps at each stage. The result is an algorithm that can be 
executed in 1082(32) = 5 steps. 


Other examples of divide and conquer are the well-known 
techniques of binary search, a sorting method known as 
quicksort, and a method for reversing the bits of a word, 
discussed on page 129. 


The method illustrated in Figure 5-1 can be committed to C 
code as 


Click here to view code image 


x (x & 0x55555555) + ((x >> 1) & 0x55555555); 
x = (x & 0x33333333) + ((x >> 2) & 0x33333333); 
x = (x & OxOFOFOFOF) + ((x >> 4) & OxOFOFOFOF); 
x = (x & OxOOFFOOFF) + ((x >> 8) & OxOOFFOOFF); 
x (x & ΟΧΟΟΟΟΕΕΕΕ) + ((x >> 16) & OxOOOO0FFFF); 


The first line uses (x >> 1) & 0x55555555 rather than the 
perhaps more natural (x & OxAAAAAAAA) >> 1, because the code 
shown avoids generating two large constants in a register. This 
would cost an instruction if the machine lacks the and not 
instruction. A similar remark applies to the other lines. 


Clearly, the last and is unnecessary, and other and's can be 
omitted when there is no danger that a field's sum will carry 
over into the adjacent field. Furthermore, there is a way to code 
the first line that uses one fewer instruction. This leads to the 
simplification shown in Figure 5-2, which executes in 21 
instructions and is branch-free. 


Click here to view code image 


int pop(unsigned x) { 
x = ((x >> 1) & 0x55555555); 
(x & 0x33333333) + ((x >> 2) & 0x33333333); 
= (x + (x >> 4)) & OxOFOFOFOF; 
x 
x 


(x >> 8); 
(x >> 16); 


x x x x x 


+ 
= x + 


return x & 0x0000003F; 


FIGURE 5-2. Counting 1-bits in a word. 


The first assignment to x is based on the first two terms of 
the rather surprising formula 
[5] 7 


== x x 
ΠΠ 


In Equation (1), we must have x Ξ 0. By treating x as an 
unsigned integer, Equation (1) can be implemented with a 
sequence of 31 shift right immediate's of 1, and 31 subtract's. The 
procedure of Figure 5-2 uses the first two terms of this on each 
2-bit field, in parallel. 

There is a simple proof of Equation (1), which is shown 
below for the case of a four-bit word. Let the word be b3b2b1bo, 
where each bj = 0 or 1. Then, 


- (b; 22 + b, 2! +b, 2°) 
- (b, - 2! + b, - 29) 
(b; 2°) 
= b,(23 -22-2!- 2°) + b,(22—2!— 20) + b (2! — 29) + δρ(29) 


= b, +b, +b, + bo. 
Alternatively, Equation (1) can be derived by noting that bit i 
of the binary representation of a nonnegative integer x is given 
by 


and summing this for i = 0 to 31. Work it out—the last term is 0 
because x < 232, Equation (1) generalizes to other bases. For 
base ten it is 


sum digits(x) = x-9| © |-9|  |-... 
Е 10 100 


where the terms are carried out until they are 0. This can be 
proved by essentially the same technique used above. 


A variation of the above algorithm is to use a base 4 analogue 
of Equation (1) as a substitute for the second executable line of 
Figure 5-2: 


Click here to view code image 


x = x - 3*((x >> 2) & 0х33333333) 


This code, however, uses the same number of instructions as the 
line it replaces (six), and requires a fast multiply-by-3 instruction. 


An algorithm in HAKMEM memo [HAK, item 169] counts the 
number of 1-bits in a word by using the first three terms of (1) 
to produce a word of 3-bit fields, each of which contains the 
number of 1-bits that were in it. It then adds adjacent 3-bit fields 
to form 6-bit field sums, and then adds the 6-bit fields by 
computing the value of the word modulo 63. Expressed in C, the 
algorithm is (the long constants are in octal) 


Click here to view code image 


int pop(unsigned x) { 
unsigned n; 


п = (x >> 1) & 033333333333; // Count bits in 
x = x - n; // each 3-bit 

n = (n >> 1) & 033333333333; // field. 

X = x - n; 

x = (x + (x >> 3)) & 030707070707; // 6-bit sums. 
return x$63; // Add 6-bit sums. 


The last line uses the unsigned modulus function. (It could be 
either signed or unsigned if the word length were a multiple of 
3.) That the modulus function sums the 6-bit fields becomes 
clear by regarding the word x as an integer written in base 64. 
The remainder upon dividing a base b integer by b - 1 is, for b 
> 3, congruent mod b - 1 to the sum of the digits and, of 
course, is less than b - 1. Because the sum of the digits in this 
case must be less than or equal to 32, mod(x, 63) must be equal 
to the sum of the digits of x, which is to say equal to the number 


of 1-bits in the original x. 


This algorithm requires only ten instructions on the DEC 
PDP-10, because that machine has an instruction for computing 
the remainder with its second operand directly referencing a 
fullword in memory. On a basic RISC, it requires about 13 
instructions, assuming the machine has unsigned modulus as one 
instruction (but not directly referencing a fullword immediate or 
memory operand). It is probably not very fast, because division 
is almost always a slow operation. Also, it doesn't apply to 64- 
bit word lengths by simply extending the constants, although it 
does work for word lengths up to 62. 


The return statement in the code above can be replaced with 
the following, which runs faster on most machines, but is 
perhaps less elegant (octal notation again). 


Click here to view code image 


return ((x * 0404040404) >> 26) + // Add 6-bit sums. 
(x >> 30); 


A variation on the HAKMEM algorithm is to use Equation (1) 
to count the number of 1’s in each 4-bit field, working on all 
eight 4-bit fields in parallel [Hay1]. Then, the 4-bit sums can be 
converted to 8-bit sums in a straightforward way, and the four 
bytes can be added with a multiplication by 0x01010101. This 
gives 


Click here to view code image 


int pop(unsigned x) { 
unsigned n; 


п = (x >> 1) & 0x77711111; // Count bits in 
x = x - n; // each 4-bit 

n = (n >> 1) & 0x77777777; // field. 

x = X - n; 

п = (n >> 1) & 0877771117; 

X = X - n; 

x = (x + (x >> 4)) & OxOFOFOFOF; // Get byte sums. 
x = x*0x01010101; // Add the bytes. 


return x >> 24; 


This is 19 instructions on the basic RISC. It works well if the 
machine is two-address, because the first six lines can be done 


with only one move register instruction. Also, the repeated use of 
the mask 0x77777777 permits loading it into a register and 
referencing it with register-to-register instructions. Furthermore, 
most of the shifts are of only one position. 


A quite different bit-counting method, illustrated in Figure 5- 
3, is to turn off the rightmost 1-bit repeatedly [Weg, RND], until 
the result is O. It is very fast if the number of 1-bits is small, 
taking 2 + 5pop(x) instructions. 


Click here to view code image 


int pop(unsigned x) { 


int n; 
n = 0; 
while (x ! = 0) í 
n= п+ 1; 
x = x & (x - 1); 
} 
returnn; 


FIGURE 5-3. Counting 1-bits in a sparsely populated word. 


This has a dual algorithm that is applicable if the number of 
1-bits is expected to be large. The dual algorithm keeps turning 
on the rightmost O-bit with x = x | (x + 1), until the result is 
all 1’s (-1). Then, it returns 32 - n. (Alternatively, the original 
number x can be complemented, or n can be initialized to 32 
and counted down.) 


A rather amazing algorithm is to rotate x left one position, 31 
times, adding the 32 terms [MM]. The sum is the negative of 
pop(x)! That is, 


31 
рор(х) = -У (x <i), (2) 
i=0 
where the additions are done modulo the word size, and the 
final sum is interpreted as a two’s-complement integer. This is 
just a novelty; it would not be useful on most machines, because 
the loop is executed 31 times and thus it requires 63 
instructions, plus the loop-control overhead. 


To see why Equation (2) works, consider what happens to a 
single 1-bit of x. It gets rotated to all positions, and when these 
32 numbers are added, a word of all 1-bits results. This is -1. To 
illustrate, consider a 6-bit word size and x — 001001 (binary): 


001001 x 

010010 х 
100100 х 2 
001001 х «3 
010010 х 4 


100100 х 5 
Of course, rotate-right would work just as well. 


The method of Equation (1) is very similar to this “rotate and 
sum" method, which becomes clear by rewriting (1) as 


31 
рор(х) = x- Y (x > i). 

i=] 
This gives a slightly better algorithm than Equation (2) provides. 
It is better because it uses shift right, which is more commonly 
available than rotate, and because the loop can be terminated 
when the shifted quantity becomes 0. This reduces the loop- 
control code and may save a few iterations. The two algorithms 
are contrasted in Figure 5-4. 


Click here to view code image 


int pop(unsigned x) { 
int i, sum; 


// Rotate and sum method // Shift right & 
subtract 
sum x; // sum = x; 


(i = 1; i <= 31; itt) { // while (x != 0) { 
x = rotatel (x, 1); // x = x >> 1; 


sum = sum + x; / / sum = sum - x; 
} // ) 


return -sum; // return sum; 


FIGURE 5-4. Two similar bit-counting algorithms. 


A less interesting algorithm that may be competitive with all 
the algorithms for pop(x) in this section is to have a table that 
contains pop(x) for, say, x in the range 0 to 255. The table can 
be accessed four times, adding the four numbers obtained. A 
branch-free version of the algorithm looks like this: 


Click here to view code image 
int pop(unsigned x) { // Table lookup. 
static char table[256] = ( 
Ou q.d, ὁ, de Be ο Зээ cO. ο, SU 0. 35 3. Ἂ, 


4; Ὁ; 9; 


б» Ὁ бу бу. Ty Dp Ὃν Gz Ty Op Tp Ty 8 
return table[x & OxFF] + 
table[(x >> 8) & OxFF] + 
table[(x >> 16) & OxFF] + 
table[(x >> 24)]; 


} 


Item 167 in [HAK] contains a short algorithm for counting 
the number of 1-bits in a 9-bit quantity that is right-adjusted and 
isolated in a register. It works only on machines with registers of 
36 or more bits. Below is a version of that algorithm that works 
on 32-bit machines, but only for 8-bit quantities. 


Click here to view code image 
* 0x08040201; // Make 4 copies. 


x 

x >> 3; // So next step hits proper bits. 
= x & Ox11111111; // Every 4th bit. 

x 

x 


х Ox11111111; // Sum the digits (each 0 or 1). 
>> 28; // Position the result. 


x x x x x 
| 


A version for 7-bit quantities is 
Click here to view code image 


* 0x02040810; // Make 4 copies, left-adjusted. 
& Ox11111111; // Every 4th bit. 

* 0х11111111; // Sum the digits (each 0 or 1). 
>> 28; // Position the result. 


x x x x 
x x x x 


In these, the last two steps can be replaced with steps to 
compute the remainder of x modulo 15. 


These are not particularly good; most programmers would 
probably prefer to use table lookup. The latter algorithm above, 
however, has a version that uses 64-bit arithmetic, which might 
be useful for a 64-bit machine that has fast multiplication. Its 
argument is a 15-bit quantity. (I don't believe there is a similar 
algorithm that deals with 16-bit quantities, unless it is known 
that not all 16 bits are 1.) The data type long long is a C 
extension found in many C compilers, old and new, for 64-bit 
integers. It is made official in the C99 standard. The suffix отт, 
makes unsigned long long constants. 


Click here to view code image 


int pop(unsigned x) { 
unsigned long long y; 


y = x * 0x0002000400080010ULL; 
у = y & 0x1111111111111111ULL; 
y = y * 0x1111111111111111ULL; 
y=y >> 60; 

return y; 


} 


Sum and Difference of Population Counts of Two Words 


To compute pop(x) + pop(y) Gf your computer does not have 
the population count instruction), some time can be saved by 
using the first two lines of Figure 5-2 on x and y separately, 
adding x and y, and then executing the last three stages of the 
algorithm on the sum. After the first two lines of Figure 5-2 are 
executed, x and y consist of eight 4-bit fields, each containing a 
maximum value of 4. Thus, x and y can safely be added, because 
the maximum value in any 4-bit field of the sum would be 8, so 
no overflow occurs. (In fact, three words can be combined in 
this way.) 


This idea also applies to subtraction. To compute pop(x) – 
pop(y), use 


pop(x) - pop(y) = pop(x) ~ (32 - рор(у)) 
= pop(x) + рор(у) - 32. 


Then, use the technique just described to compute pop(x) + 


pop(y). The code is shown in Figure 5-5. It uses 32 instructions, 
versus 43 for two applications of the code in Figure 5-2 followed 
by a subtraction. 


Click here to view code image 


int popDiff(unsigned x, unsigned y) { 


х= x - ((x >> 1) & 0x55555555); 

x = (x & 0x33333333) + ((x >> 2) & 0x33333333); 
у CONT 

y=y- ((y >> 1) & 0x55555555); 

y = (y & 0x33333333) + ((y >> 2) & 0x33333333); 
х= xX t y; 

x = (x & OxOFOFOFOF) + ((x >> 4) & OxOFOFOFOF); 
х = x + (x >> 8); 

х = x + (x >> 16); 

return (x & 0x0000007F) - 32; 


FIGURE 5-5. Computing pop(x) - pop(y). 


Comparing the Population Counts of Two Words 


Sometimes one wants to know which of two words has the 
larger population count without regard to the actual counts. Can 
this be determined without doing a population count of the two 
words? Computing the difference of two population counts as in 
Figure 5-5, and comparing the result to 0 is one way, but there 
is another way that is preferable if either the population counts 
are expected to be low or if there is a strong correlation between 
the particular bits that are set in the two words. 


The idea is to clear a single bit in each word until one of the 
words is all zero; the other word then has the larger population 
count. The process runs faster in its worst and average cases if 
the bits that are 1 at the same positions in each word are first 
cleared. The code is shown in Figure 5-6. The procedure returns 
a negative integer if pop(x) < pop(y), 0 if pop(x) = pop(y), and 
a positive integer (1) if pop(x) > pop(y). 


Click here to view code image 
int popCmpr(unsigned xp, unsigned yp) { 


unsigned x, y; 
х = Xp & ~ур; // Clear bits where 


у yp & ~xp; // both are 1. 


while (1) ( 
if (x == 0) return y | -y; 
if (y == 0) return 1; 
х= x & (x - 1); // Clear one bit 
y = y & (y - 1); // from each. 


FIGURE 5-6. Comparing pop(x) with pop(y). 


After clearing the common 1-bits in each 32-bit word, the 
maximum possible number of 1-bits in both words together is 
32. Therefore, the word with the smaller number of 1-bits can 
have at most 16. Thus, the loop in Figure 5-6 is executed a 
maximum of 16 times, which gives a worst case of 119 
instructions executed on the basic RISC (16 · 7 + 7) A 
simulation using uniformly distributed random 32-bit integers 
showed that the average population count of the word with the 
smaller population count is approximately 6.186, after clearing 
the common 1-bits. This gives an average execution time of 
about 50 instructions executed for random 32-bit inputs, not as 
good as using Figure 5-5. For this procedure to beat that of 
Figure 5-5, the number of 1-bits in either x or y, after clearing 
the common 1-bits, would have to be three or less. 


Counting the 1-bits in an Array 


The simplest way to count the number of 1-bits in an array 
(vector) of fullwords, in the absence of the population count 
instruction, is to use a procedure such as that of Figure 5-2 on 
page 82 on each word of the array and simply add the results. 
We call this the “naive” method. Ignoring loop control, the 
generation of constants, and loads from the array, it takes 16 
instructions per word: 15 for the code of Figure 5-2, plus one for 
the addition. We assume the procedure is expanded in line, the 
masks are loaded outside the loop, and the machine has a 
sufficient number of registers to hold all the quantities used in 
the calculation. 


Another way is to use the first two executable lines of Figure 
5-2 on groups of three words in the array, adding the three 
partial results. Because each partial result has a maximum value 
of 4 in each four-bit field, the sum of the three has a maximum 


value of 12 in each four-bit field, so no overflow occurs. This 
idea can be applied to the 8- and 16-bit fields. Coding and 
compiling this method indicates that it gives about a 2096 
reduction over the naive method in total number of instructions 
executed on the basic RISC. Much of the savings are cancelled by 
the additional housekeeping instructions required. We will not 
dwell on this method because there is a much better way to do it. 


The better way seems to have been invented by Robert 
Harley and David Seal in about 1996 [Seall]. It is based on a 
circuit called a carry-save adder (CSA), or 3:2 compressor. A CSA 
is simply a sequence of independent full addersi [H&P], and it is 
often used in binary multiplier circuits. 


In Boolean algebra notation, the logic for each full adder is 


h = ab + ac + bc = ab + (а + b)c = ab + (a @ b)c, 
1 < (ao b) Ф c. 


where a, b, and c are the 1-bit inputs, [ is the low-bit output 
(sum) and h is the high-bit output (carry). Changing a + b on 
the first line to a @ b is justified because when a and b are both 
1, the term ab makes the value of the whole expression 1. By 
first assigning a © b to a temporary, the full adder logic can be 
evaluated in five logical instructions, each operating on 32 bits 
in parallel (on a 32-bit machine). We will refer to these five 
instructions as CSA(h, 1, a, b, c). This is a “macro,” with h and ! 
being outputs. 


One way to use the CSA operation is to process elements of 
the array A in groups of three, reducing each group of three 
words to two, and applying the population count operation to 
these two words. In the loop, these two population counts are 
summed. After executing the loop, the total population count of 
the array is twice the accumulated population count of the CSA's 
high-bit outputs, plus the accumulated population count of the 
low-bit outputs. 


Let пс be the number of instructions required for the CSA 
steps and np be the number of instructions required to do the 
population count of one word. On a typical RISC machine пс = 
5 and np — 15. Ignoring loads from the array and loop control 
(the code for which may vary quite a bit from one machine to 
another), the loop discussed above takes (пс + 2пр + 2)/3 = 


12.33 instructions per word of the array (the “+2” is for the two 
additions in the loop). This is in contrast to the 16 instructions 
per word required by the naive method. 


There is another way to use the CSA operation that results in 
a program that's more efficient and slightly more compact. This 
is shown in Figure 5-7. It takes (πε + пр + 1)/2 =10.5 
instructions per word (ignoring loop control and loads). In this 
code, the CSA operation expands into 


Click here to view code image 


#define CSA(h,l, a,b,c) \ 


{unsigned u = a ^ b; unsigned v = с; N 
h = (a & b) | (u & v); 1 =u “ v;) 
int popArray(unsigned A[], int n) { 
int tot, i; 


unsigned ones, twos; 


tot = 0; // Initialize. 

ones = 0; 

for (i = 0; i <= n - 2; i = i + 2) { 
CSA (twos, ones, ones, A[i], А[1+1]) 
tot = tot + pop(twos); 

} 

tot = 2*tot + pop(ones); 

if (n & 1) // If there's a last one, 
tot = tot + pop(A[i]); // add it in. 


return tot; 


FIGURE 5-7. Array population count, processing elements in 
groups of two. 


Click here to view code image 


u = ones ^ A[i]; 
v = А[1+1]; 
twos = (ones & A[i]) | (u & v); 


^ 


ones = u V; 


The code relies on the compiler to common the loads. 
There are ways to use the CSA operation to further reduce the 


number of instructions required to compute the population 
count of an array. They are most easily understood by means of 
a circuit diagram. For example, Figure 5-8 illustrates a way to 
code a loop that takes array elements eight at a time and 
compresses them into four quantities, labeled eights, fours, twos, 
and ones. The fours, twos, and ones are fed back into the CSAs on 
the next loop iteration, and the 1-bits in eights are counted by an 
execution of the word-level population count function, and this 
count is accumulated. When all of the array has been processed, 
the total population count is 


8pop(eights) + 4pop(fours) + 2pop(twos) + pop(ones). 


fours twos ones а; аз 


eights fours 


FIGURE 5-8. A circuit for the array population count. 


The code is shown in Figure 5-9, which uses the CSA macro 
defined in Figure 5-7. The numbering of the CSA blocks in 
Figure 5-8 corresponds to the order of the CSA macro calls in 
Figure 5-9. The execution time of the loop, exclusive of array 
loads and loop control is (7n; + np + 1)/8 = 6.375 
instructions per word of the array. 


Click here to view code image 


int popArray(unsigned A[], int n) { 


int tot, i; 
unsigned ones, twos, twosA, twosB, 
fours, foursA, foursB, eights; 


tot = 0; // Initialize. 
fours = twos = ones = 0; 


for (і = 0; і <= n - 8 i=i+ 8) { 
CSA(twosA, ones, ones, А(11, А[1+1]) 
CSA(twosB, ones, ones, А[1+2], А[1+3 
CSA(foursA, twos, twos, twosA, twosB 


( 
( ) 
( 
CSA(twosA, ones, ones, A[it4], А[1+5 
( 
( 
( 


] 

) 

1) 

CSA(twosB, ones, ones, А[1+6], А[1+7]) 
CSA(foursB, twos, twos, twosA, twosB) 
CSA(eights, fours, fours, foursA, foursB) 
tot = tot + pop(eights); 

} 


tot = 8*tot + 4*pop(fours + 2*pop(twos) + pop(ones); 


for (і = i; i < n; itt) // Simply add in the 
last 
tot = tot + рор(А(11)6 // 0 to 7 elements. 
return tot; 


FIGURE 5-9. Array population count, processing elements in 
groups of eight. 


The CSAs can be connected in many arrangements other than 
that shown in Figure 5-8. For example, increased parallelism 
might result from feeding the first three array elements into one 
CSA, and the next three into a second CSA, which allows the 
instructions of these two CSAs to execute in parallel. One might 
also be able to permute the three input operands of the CSA 
macros for increased parallelism. With the plan shown in Figure 
5-8, one can easily see how to use only the first three CSAs to 
construct a program that processes array elements in groups of 
four, and also how to expand it to construct programs that 
process array elements in groups of 16 or more. The plan shown 
also spreads out the loads somewhat, which would be 
advantageous for a machine that has a relatively low limit on 
the number of loads that can be outstanding at any one time. 


The plan of Figure 5-8 can be generalized so that very few 


word population counts are done. To sketch how this program 
might be constructed, it needs an array of m x 2 words to hold 
two of each of the variables we have called ones, twos, fours, and 
so forth. For an array of size n, choosing т = llogo(n + 1)! + 1 
is sufficient (m — 31 is sufficient for any size array that can be 
held in a machine with a 32-bit byte-addressed space). A byte 
array of size m is also needed to keep track of how many (0, 1, 
or 2) values are currently in each row of the mx2 array. The 
program processes array elements in groups of two. For each 
group, the CSA is invoked to compress those two array elements 
with a saved value of ones, which is most conveniently kept in 
the [0,0] position of the тх2 array. In an inner loop, the 
resulting twos is saved in the array, by scanning down (usually 
not far at all) to find a row with fewer than two items. If the 
twos row is full, its two values are combined with twos (using the 
CSA). The twos output is put in the array, resetting its row count 
to 1. The scan continues with the fours output to find a place to 
put it, and so forth. 


After completing the pass over the input array, the program 
next makes a pass over the (much shorter) m*x2 array, 
compressing all full rows, so that all rows contain only one 
significant value. Lastly, the program invokes the word-level 
population count operation on the first element of each row 
until a row with a zero count is encountered, computing the 
total array population count as 


pop(row 0) + 2pop(row 1) + 4pop(row 2) + .... 


The value suggested above for m ensures that the last row will 
have a zero count, which can be used to terminate the scans. 


The resulting program executes exactly llogo(n + 3)! word 
population counts. Unfortunately it is not practical, because the 
housekeeping steps for loading from and storing into the 
intermediate result arrays outweigh the computational 
instructions that are saved. An experimental program (without 
trying too hard to optimize it) ran in about 29 instructions per 
array word (counting all instructions in the loop). This is 
significantly worse than the naive method. 


Table 5-1 summarizes the number of instructions executed by 
this plan for various group sizes. The values in the middle two 
columns ignore loads and loop control. The fourth column gives 


the total loop instruction execution count, per word of the input 
array, produced by a compiler for the basic RISC machine 
(which does not have indexed loads). 


TABLE 5-1. INSTRUCTIONS PER WORD FOR THE ARRAY 
POPULATION COUNT 


Instructions Exclusive of Loads and Loop All Instructions 
Control in Loop 
(compiler 
output) 


Gln,*n,* 1)/32 


n,—n,+1 
Groups of 2” n, - 
2n 


For small arrays, there are better plans than that of Figure 5- 
8. For example, for an array of seven words, the plan of Figure 
5-10 is quite efficient [Seall]. It executes in 4n; + Зпр + 4 = 
69 instructions, or 9.86 instructions per word. Similar plans exist 
that apply to arrays of size 2k — 1 words for any positive integer 
К. The plan for 15 words executes in 11пс + 4np + 6 = 121 
instructions, or 8.07 instructions per word. 


fours twos 


FIGURE 5-10. A circuit for the total population count of 
seven words. 


Applications 


An application of the population count function is in computing 
the *Hamming distance" between two bit vectors, a concept 
from the theory of error-correcting codes. The Hamming 
distance is simply the number of places where the vectors differ; 
that is, 


dist(x, y) = pop(x@y). 


See, for example, the chapter on error-correcting codes in 
[Dewd]. 


Another application is to allow reasonably fast direct-indexed 
access to a moderately sparse array A that is represented in a 
certain compact way. In the compact representation, only the 
defined, or nonzero, elements of the array are stored. There is an 
auxiliary bit string array bits of 32-bit words, which has a 1-bit 
for each index i for which A[i] is defined. As a speedup device, 
there is also an array of words bitsum such that bitsum[j] is the 
total number of 1-bits in all the words of bits that precede entry 
j. This is illustrated below for an array in which elements 0, 2, 
32, 47, 48, and 95 are defined. 


bits bitsum data 


0x00000005 0 A[0] 
0x0001 8001 9 A[2] 
0x80000000 5 A[32] 
A[47] 
A[48] 
A[95] 


Given an index i, 0 < i < 95, the corresponding index 
sparse i into the data array is given by the number of 1-bits in 
array bits that precede the bit corresponding to i. This can be 
calculated as follows: 


Click here to view code image 


j = i >> 5; // j = i/32. 

k = і & 31; // k = rem(i, 32); 
mask = 1 << К; // А "1" at position К. 
if ((bits[j] & mask) == 0) goto no such element; 
mask - mask - 1; // 118 to right of К. 


sparse i = bitsum[j] + pop(bits[j] & mask); 


The cost of this representation is two bits per element of the full 
array. 


The population function can be used to generate binomially 
distributed random integers. To generate an integer drawn from 
a population given by BINOMIAL(t p) where t is the number of 
trials and p — 1/2, generate t random bits and count the number 
of l's in the t bits. This can be generalized to probabilities p 
other than 1/2; see for example [Knu2, sec. 3.4.1, prob. 27]. 


Still another application of the population function is in 
computing the number of trailing 05 in a word (see “Counting 
Trailing 0’s” on page 107). 

According to computer folklore, the population count 
function is important to the National Security Agency. No one 
(outside of NSA) seems to know just what they use it for, but it 
may be in cryptography work or in searching huge amounts of 
material. 


5-2 Parity 


The "parity" of a string refers to whether it contains an odd or 
an even number of 1-bits. The string has “odd parity" if it 
contains an odd number of 1-bits; otherwise, it has “even 
parity." 


Computing the Parity of a Word 


Here we mean to produce a 1 if a word x has odd parity, and a 0 
if it has even parity. This is the sum, modulo 2, of the bits of x— 
that is, the exclusive or of all the bits of x. 


One way to compute this is to compute pop(x); the parity is 
the rightmost bit of the result. This is fine if you have the 
population count instruction, but if not, there are better ways 
than using the code for pop(x). 


A rather direct method is to compute 


n-1 
уе @(x> j. 
i = 0 
where n is the word size, and then the parity of x is given by the 
rightmost bit of y. (Here @ denotes exclusive or, but for this 
formula ordinary addition could be used.) 


The parity can be computed much more quickly, for 
moderately large n, as follows (illustrated for n = 32; the shifts 
can be signed or unsigned): 


y = x * (x >> 1) 
y = y ^ (y > 2); 
y = y * (y > 4); (3) 
y= y * (Y >> 8); 
y = y ^ (y >>16); 


This executes in ten instructions, as compared to 62 for the first 
method, even if the implied loop is completely unrolled. Again, 
the parity bit is the rightmost bit of y. In fact, with either of 
these, if the shifts are unsigned, then bit i of y gives the parity of 
the bits of x at and to the left of i. Furthermore, because 
exclusive or is its own inverse, xi Ө xj is the parity of bits i — 1 
through j, for i 2 j. 


This is an example of the “parallel prefix,” or “scan” 


operation, which has applications in parallel computing [KRS; 
HS]. Given a sufficient number of processors, it can convert 
certain seemingly serial processes from O(n) to O(logon) time. 
For example, if you have an array of words and you wish to 
compute the exclusive or scan operation on the entire array of 
bits, you can first use (3) on the entire array, and then continue 
with shifts of 32 bits, 64 bits, and so on, doing exclusive or's on 
the words of the array. This takes more elementary (word 
length) exclusive or operations than a simple left-to-right process, 
and hence it is not a good idea for a uniprocessor. But on a 
parallel computer with a sufficient number of processors, it can 
do the job in O(logon) rather than O(n) time (where n is the 
number of words in the array). 


A direct application of (3) is the conversion of a Gray coded 
integer to binary (see page 312). 


If the code (3) is changed to use left shifts, the parity of the 
whole word x winds up in the leftmost bit position, and bit i of y 
gives the parity of the bits of x at and to the right of position i. 
This is called the "parallel suffix" operation, because each bit is 
a function of itself and the bits that follow it. 


If rotate shifts are used, the result is a word of all 1’s if the 
parity of x is odd, and of all 0’s if even. 


The five assignments in (3) can be done in any order 
(provided variable x is used in the first one). If they are done in 
reverse order, and if you are interested only in getting the parity 
in the low-order bit of y, then the last two lines: 


Click here to view code image 


y ^(у >> 2); 
y ^(y >> 1); 


y 
y 


can be replaced with [Huef] 
Click here to view code image 
y = 0x6996 >> (y & OXF); 


This is an *in-register table lookup" operation. On the basic RISC 
it saves one instruction, or two if the load of the constant is not 
counted. The low-order bit of y has the original word's parity, 
but the other bits of y do not contain anything useful. 


The following method executes in nine instructions and 
computes the parity of x as the integer O or 1 (the shifts are 
unsigned). 


Click here to view code image 


x o^ (x c Ij 
(x ^ (x >> 2)) & 0х11111111; 
х*0х11111111; 
(x >> 28) & 1; 


'Ü κ X X 


After the second statement above, each hex digit of x is O or 1, 
according to the parity of the bits in that hex digit. The multiply 
adds these digits, putting the sum in the high-order hex digit. 
There can be no carry out of any hex column during the add part 
of the multiply, because the maximum sum of a column is 8. 


The multiply and shift could be replaced by an instruction to 
compute the remainder after dividing x by 15, giving a (slow) 
solution in eight instructions, if the machine has remainder 
immediate. 


On a 64-bit machine, the above code employing 
multiplication gives the correct result after making the obvious 
changes (expand the hex constants to 16 nibbles, each with 
value 1, and change the final shift amount from 28 to 60). In 
this case, the maximum sum in any 4-bit column of the partial 
products, other than the most significant column, is 15, so again 
no overflow occurs that affects the result in the most significant 
column. On the other hand, the variation that computes the 
remainder upon division by 15 does not work on a 64-bit 
machine, because the remainder is the sum of the nibbles 
modulo 15, and the sum may be as high as 16. 


Adding a Parity Bit to a 7-Bit Quantity 


Item 167 in [HAK] contains a novel expression for putting even 
parity on a 7-bit quantity that is right-adjusted and isolated in a 
register. By this we mean to set the bit to the left of the seven 
bits, to make an 8-bit quantity with even parity. Their code is for 
a 36-bit machine, but it works on a 32-bit machine as well. 


modu((x * 0x10204081) & 0x888888FF, 1920) 


Here, modu(a, b) denotes the remainder of a upon division by b, 
with the arguments and result interpreted as unsigned integers, 


“*” denotes multiplication modulo 232, and the constant 1920 is 
15 - 27. Actually, this computes the sum of the bits of x, and 
places the sum just to the left of the seven bits comprising x. For 
example, the expression maps 0x0000007F to 0x000003FF, and 
0x00000055 to 0x00000255. 


Another ingenious formula from [HAK] is the following, 
which puts odd parity on a 7-bit integer: 


modu((x * 0x00204081) | Ox3DB6DB00, 1152), 


where 1152 = 9. 27. To understand this, it helps to know that 
the powers of 8 are +1 modulo 9. If the Ox3DB6DBOO is 
changed to OXBDB6DBOO, this formula applies even parity. 


These methods are not practical on today's machines, because 
memory is cheap but division is still slow. Most programmers 
would compute these functions with a simple table lookup. 


Applications 


The parity operation is widely used to calculate a check bit to 
append to data. It is also useful in multiplying bit matrices in 
GF(2) (in which the add operation is exclusive or). 


5-3 Counting Leading 0’s 


There are several simple ways to count leading 0’s with a binary 
search technique. Below is a model that has several variations. It 
executes in 20 to 29 instructions on the basic RISC. The 
comparisons are “logical” (unsigned integers). 


Click here to view code image 


if (x == 0) return(32); 

n = 0; 

if (x <= 0х0000ЕЕЕЕ) {п = n +16; x = x <<16;} 
if (x <= OxOOFFFFFF) {п =n + 8; x = << 8;} 
if (x <= OxOFFFFFFF) (n = n + 4; x = << 4;} 
if (x <= Ox3FFFFFFF) {n = n + 2; x = << 2;} 
if (x <= Ox7FFFFFFF) {n =n + 1;} 

return n; 


One variation is to replace the comparisons with and’s: 
Click here to view code image 


if (κ & OxFFFF0000) == 0) {п = п +16; x = x <<16;} 


if ((x & OxFF000000) == 0) (n = n 8; x = << 8} 


Another variation, which avoids large immediate values, is to 
use shift right instructions. 


The last if statement is simply adding 1 to n if the high- 
order bit of x is 0, so an alternative, which saves a branch 
instruction, is: 


Click here to view code image 
п=п+1- (x >> 31); 


The “+ 1” in this assignment can be omitted if n is initialized to 
1 rather than to 0. These observations lead to the algorithm (12 
to 20 instructions on the basic RISC) shown in Figure 5-11. A 
further improvement is possible for the case in which x begins 
with a 1-bit: change the first line to 


Click here to view code image 
if ((int)x <= 0) return (~x >> 26) & 32; 


Click here to view code image 


int nlz(unsigned x) { 


int n; 

if (x == 0) return(32); 

п = 1; 

if ((х >> 16) == 0) (n = n +16; x = x <<16;} 
if ((x >> 24) == 0) (n = + 8; х = << 8;} 
if ((х >> 28) == 0) Tm = п + 4; х = x << 4;} 
if ((x >> 30) == 0) (n = n + 2; x = x<< 2;) 
n = n - (x >> 31); 

return n; 


FIGURE 5-11. Number of leading zeros, binary search. 


Figure 5-12 illustrates a sort of reversal of the above. It 
requires fewer operations the more leading 05 there are, and 
avoids large immediate values and large shift amounts. It 
executes in 12 to 20 instructions on the basic RISC. 


Click here to view code image 


int nlz (unsigned x) { 
unsigned y; 


int n; 

n = 32; 

y = x >>16; if (y != 0) (n = n -16; x = y;} 
y= х >> 8; if (y != 0) (n =n - 8; x = у; } 
y = x >> 4; if (y != 0) {п =n - 4; x = y; 
у = x >> 2; if (y != 0) {п =n - 2; x ур! 
y = x >> 1; if (y! 0) return n - 2; 
return n - x; 


FIGURE 5-12. Number of leading zeros, binary search, 
counting down. 


This algorithm is amenable to a "table assist": the last four 
executable lines can be replaced by 


Click here to view code image 


static char table[256] = (0,1,2,2,3,3,3,3,4,4,...,8); 
return n - table[x]; 


Many algorithms can be aided by table lookup, but this will not 
often be mentioned here. 


For compactness, this and the preceding algorithms in this 
section can be coded as loops. For example, the algorithm of 
Figure 5-12 becomes the algorithm shown in Figure 5-13. This 
executes in 23 to 33 basic RISC instructions, ten of which are 
conditional branches. 


Click here to view code image 


int nlz(unsigned x) { 
unsigned y; 
int n, c; 


n = 32; 

c = 16; 

do { 
y = x >> с; if (y != 0) (n = n - c; x = yi) 
c=c>> 1; 

} while (c != 0); 

return n - x; 


FIGURE 5-13. Number of leading zeros, binary search, coded 
as a loop. 


One can, of course, simply shift left one place at a time, 
counting, until the sign bit is on; or shift right one place at a 
time until the word is all 0. These algorithms are compact and 
work well if the number of leading 0’s is expected to be small or 
large, respectively. One can combine the methods, as shown in 
Figure 5-14. We mention this because the technique of merging 
two algorithms and choosing the result of whichever one stops 
first is more generally applicable. It leads to code that runs fast 
on superscalar machines, because of the proximity of 
independent instructions. (These machines can execute two or 
more instructions simultaneously, provided they аге 
independent.) 


Click here to view code image 


int nlz(int x) { 


int y, n; 
n = 0; 
у = x; 
L: if (x < 0) return п; 
if (y == 0) return 32 - n; 
п= п + 1; 
х = х << 1; 
y = y >> 1; 
goto L; 


FIGURE 5-14. Number of leading zeros, working both ends at 
the same time. 


On the basic RISC, this executes in min(3 + 6nlz(x), 5 + 
6(32 - nlz(x))) instructions, ог 99 worst case. One can imagine a 
superscalar machine executing the entire loop body in one cycle 
if the comparison results are obtained as a by-product of the 
shifts, or in two cycles otherwise, plus the branch overhead. 


It is straightforward to convert either of the algorithms of 
Figure 5-11 or Figure 5-12 to a branch-free counterpart. Figure 


5-15 shows a version that does the job in 28 basic RISC 
instructions. 


Click here to view code image 


int nlz(unsigned x) { 
int y, m, n; 


y = -(x >> 16); // If left half of x is 0, 
m= (y >> 16) & 16; // set n = 16. If left half 
n = 16 - m; // is nonzero, set n = 0 and 
x = x >> m; // shift x right 16. 
// Now x is of the form 0000хххх. 
у = x - 0x100; // If positions 8-15 are 0, 
m= (y >> 16) & 8; // add 8 to n and shift x left 8. 
n = n + п; 
x = x << m; 
y = x - 0x1000; // If positions 12-15 are 0, 
m= (y >> 16) & 4; // add 4 to n and shift x left 4. 
n = n + m; 
x = x << m; 
y = x - 0x4000; // If positions 14-15 are 0, 
ш = (у >> 16) & 2; // add 2 to n and shift x left 2. 
n = n + m; 
x = х << m; 
y = x >> 14; // Set y = 0, 1, 2, or 3. 
m= y & “(y >> 1); // Set m= 0, 1, 2, or 2 resp. 
return n + 2 - m; 


FIGURE 5-15. Number of leading zeros, branch-free binary 
search. 


If your machine has the population count instruction, a good 
way to compute the number of leading zeros function is given in 
Figure 5-16. The five assignments to x can be reversed, or, in 
fact, done in any order. This is branch-free and takes 11 
instructions. Even if population count is not available, this 
algorithm may be useful. Using the 21-instruction code for 
counting 1-bits given in Figure 5-2 on page 82, it executes in 32 
branch-free basic RISC instructions. 


Click here to view code image 


int nlz(unsigned x) { 
int pop(unsigned x); 


x x x x 
x х x x 
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= х (251 
return рор(-х); 


FIGURE 5-16. Number of leading zeros, right-propagate and 
count 1-bits. 


Robert Harley [Harley] devised an algorithm for nlz(x) that is 
very similar to Seal's algorithm for ntz(x) (see Figure 5-25 on 
page 111). Harley's method propagates the most significant 1-bit 
to the right using shift's and or's, and multiplies modulo 232 by a 
special constant, producing a product whose high-order six bits 
uniquely identify the number of leading 0’s in x. It then does a 
shift right and a table lookup (indexed load) to translate the six- 
bit identifier to the actual number of leading O's. As shown in 
Figure 5-17, it consists of 14 instructions, including a multiply, 
plus an indexed load. Table entries shown as u are unused. 


Click here to view code image 


int nlz(unsigned x) { 


static char table[64] - 
132,31, ох, uu 305. 3, йу 15, uj м; u.29,10, 2; 


üz u,;12,14,21, u,19, wu; u;280, u,25, м, 9, 1, 


l7, αν 4, Uy а; м м, 13,22,20,. 0,26, “AW 


u,18, 
D, d; 323, 1,27, м, бу u,24, 7, uu, 9, м, О, 

u}; 

x= x | (x= >> Ds // Propagate leftmost 

x =x | (x >> 2); // 1-bit to the right. 

x = x | (x >> 4); 

x = x | (x >> 8); 

x = x | (x >>16); 

x = x*0x06EB14F9; // Multiplier is 7*255**3. 


return table[x >> 26]; 


FIGURE 5-17. Number of leading zeros, Harley's algorithm. 


The multiplier is 7:2553, so the multiplication can be done as 
shown below. In this form, the function consists of 19 
elementary instructions, plus an indexed load. 


Click here to view code image 


x = (w<< 3) = xi // Multiply by 7. 
x = (x<< 8) - x; // Multiply by 255. 
x = (x<< 8) - xi // Again. 

x = (х << 8) - x; // Again. 


There are many multipliers that have the desired uniqueness 
property and whose factors are all of the form 2k + 1. The 
smallest is OxO45BCED1 = 17 · 65: 129 513. There аге no such 
multipliers consisting of three factors if the table size is 64 or 
128 entries. If the table size is 256 entries, however, there are a 
number of such multipliers. The smallest is 0x01033CBF = 
652551025 (using this would save two instructions at the 
expense of a larger table). 


Julius Goryavsky [Gor] has found several variations of 
Harley's algorithm that reduce the table size at the expense of a 
few instructions, or have improved parallelism, or have other 
desirable properties. One, shown in Figure 5-18, is a clear 
winner if the multiplication is done with shifts and adds. The 
code changes only the table and the lines that contain the shift 
right of 16 and the following multiply in Figure 5-17. If the 
machine has and not, this saves two instructions because the 
multiplier can be factored as 511-2047 · 16383 (mod 232), which 
can be done in six elementary instructions rather than eight. If 
the machine does not have and not, it saves one instruction. 


Click here to view code image 


Static char table[64] = 
(32,20,19, м, u,19; м, 7, 10,17, uy а, а, 6, u; 
ú, 9, uj,16; τ; м, 1,29, ú, 13; м, u,24, 5, UW, а, 
u,21,. ù 8,11; u,;15, AG li d, up 2,27, 0,2575. Uy 
22, προ м; uj 3,280, а; 23; uw. 4,29, Up Uy 30;7-3L р; 


x = x & “(x >> 16); 
x — x*OxFD7049FF; 


FIGURE 5-18. Number of leading zeros, Goryavsky's variation 
of Harley's algorithm. 


Floating-Point Methods 


The floating-point post-normalization facilities can be used to 
count leading zeros. It works out quite well with IEEE-format 
floating-point numbers. The idea is to convert the given 
unsigned integer to double-precision floating-point, extract the 
exponent, and subtract it from a constant. Figure 5-19 illustrates 
a complete procedure for this. 


Click here to view code image 


int nlz(unsigned k) { 


union { 
unsigned asInt[2]; 
doubleasDouble; 

}; 

int n; 

asDouble = (double)k + 0.5; 

п = 1054 - (as!nt[LE] >> 20); 

return n; 


FIGURE 5-19. Number of leading zeros, using IEEE floating- 
point. 


The code uses the C+ + “anonymous union” to overlay an 
integer with a double-precision floating-point quantity. Variable 
LE must be 1 for execution on a little-endian machine, and 0 for 
big-endian. The addition of 0.5, or some other small number, is 
necessary for the method to work when x = 0. 


We will not attempt to assess the execution time of this code, 
because machines differ so much in their floating-point 
capabilities. For example, many machines have their floating- 
point registers separate from the integer registers, and on such 
machines data transfers through memory may be required to 
convert an integer to floating-point and then move the result to 


an integer register. 


The code of Figure 5-19 is not valid C or C+ + according to 
the ANSI standard, because it refers to the same memory 
locations as two different types. Thus, one cannot be sure it will 
work on a particular machine and compiler. It does work with 
IBM’s XLC compiler on AIX, and with the GCC compiler on AIX 
and on Windows 2000 and XP, at all optimization levels (as of 
this writing, anyway). If the code is altered to do the overlay 
defining with something like 


Click here to view code image 


xx = (double)k + 0.5; 
п = 1054 - (*((unsigned *) &xx + LE) >> 20); 


it does not work on these systems with optimization turned on. 
This code, incidentally, violates a second ANSI standard, 
namely, that pointer arithmetic can be performed only on 
pointers to array elements [Cohen]. The failure, however, is due 
to the first violation, involving overlay defining. 


In spite of the flakiness of this code,2 three variations are 
given below. 


Click here to view code image 


asDouble = (double) к; 
n = 1054 - (asInt[LE] >> 20); 
п = (п & 31) + (n >> 9); 


k =k & “(k >> 1); 
asFloat = (float)k + 0.5f; 
n = 158 - (asInt >> 23); 


k = k & “(k >> 1); 


asFloat = (float) к; 
n = 158 - (asInt >> 23); 
n= (n& 31) + (n >> 6); 


In the first variation, the problem with x = o is fixed not by 
a floating-point addition of 0.5, but by integer arithmetic on the 
result n (which would be 1054, or 0x41E, if the correction were 
not done). 


The next two variations use single-precision floating-point, 
with the “anonymous union” changed in an obvious way. Here 
there is a new problem: Rounding can throw off the result when 


the rounding mode is either round to nearest (almost universally 
used) or round toward + œ. For round to nearest mode, the 
rounding problem occurs for x in the ranges hexadecimal 
FFFFFF80 to FFFFFFFF, 7FFFFFCO to 7FFFFFFF, 3FFFFFEO to 
3FFFFFFF, and so on. In rounding, an add of 1 carries all the 
way to the left, changing the position of the most significant 1- 
bit. The correction steps used above clear the bit to the right of 
the most significant 1-bit, blocking the carry. If x is a 64-bit 
quantity, this correction is also needed for the code of Figure 5- 
19 and for the first of the three variations given above. 


The GNU C/C+ + compiler has a unique feature that allows 
coding any of these schemes as a macro, giving in-line code for 
the function references [Stall]. This feature allows statements, 
including declarations, to be inserted in code where an 
expression is called for. The sequence of statements would 
usually end with an expression, which is taken to be the value of 
the construction. Such a macro definition is shown below, for 
the first single-precision variation. (In C, it is customary to use 
uppercase for macro names.) 


Click here to view code image 


#define NLZ(kp) \ 


({union {unsigned asInt; float asFloat;}; N 
unsigned к = (kp), _kk = _k & ~(_k >> 1); \ 
 asFloat = (float) kk + 0.5f; \ 

158 - ( asInt >> 23);}) 


The underscores are used to avoid name conflicts with 
parameter kp; presumably, user-defined names do not begin 
with underscores. 


Comparing the Number of Leading Zeros of Two Words 


There is a simple way to determine which of two words x and y 
has the larger number of leading zeros [Knu5] without actually 
computing nlz(x) or nlz(y). The methods are shown in the 
equivalences below. The three relations not shown are, of 
course, obtained by complementing the sense of the comparison 
on the right. 


nlz(x) = nlz(y) ifandonly if (xy) < (х ἄγ) 
nlz(x) < nlz(y)  ifandonly if (x & -) 5 y 
nlz(x) < nlz(y) ifandonly if (y & Ax) Ex 


Relation to the Log Function 


The “nlz” function is, essentially, the “integer log base 2" 
function. For unsigned x = 0, 


| log,(x) | = 31 – nlz(x). and 
| log,(x) | = 32- nlz(x - 1). 


See also Section 11-4, “Integer Logarithm,” on page 291. 


Another closely related function is bitsize, the number of bits 
required to represent its argument as a signed quantity in two’s- 
complement form. We take its definition to be 


l; x = ο 
2, x = —2 or 1, 
4<x<- <x< 
bitsize(x) = 3, 4<x<-3 0r2<x<3, 
4, —8 <х<-5 or 4<x<7, 
32, -231 <x <— 230+ 1 ог 230 € x x 231—1]. 


From this definition, bitsize(x) = bitsize( —€— 1). But — x — 
= ~x, so an algorithm for bitsize is (where the shift is signed) 


Click here to view code image 


х = х^ (x >> 31); // If (x < 0) x = -x - 1; 
return 33 - nlz (x); 


An alternative, which is the same function as bitsize(x) 
except it gives the result 0 for x = 0, is 


Click here to view code image 
32 = nlz(x ^ (w<< 1)) 


Applications 


Two important applications of the number of leading zeros 
function are in simulating floating-point arithmetic operations 
and in various division algorithms (see Figure 9-1 on page 185 
and Figure 9-3 on page 196). The instruction seems to have a 
miscellany of other uses. 


It can be used to get the “x = y" predicate in only three 
instructions (see *Comparison Predicates" on page 23), and as an 
aid in computing certain elementary functions (see pages 281, 
284, 290, and 294). 


A novel application is to generate exponentially distributed 
random integers by generating uniformly distributed random 
integers and taking nlz of the result [GLS1]. The result is O with 
probability 1/2, 1 with probability 1/4, 2 with probability 1/8, 
and so on. Another application is as an aid in searching a word 
for a consecutive string of 1-bits (or O-bits) of a certain length, a 
process that is used in some disk block allocation algorithms. For 
these last two applications, the number of trailing zeros function 
could also be used. 


5-4 Counting Trailing 0’s 


If the number of leading zeros instruction is available, then the 
best way to count trailing O's is, most likely, to convert it to a 
count leading 0’s problem: 


32 — nlz(^x&(x — 1)). 


If population count is available, a slightly better method is to 
form a mask that identifies the trailing 0’s, and count the 1-bits 
in it [Hay2], such as 


pop(x & (x — 1)), and 
32 — pop(x | —x). 


Variations exist using other expressions for forming a mask 
that identifies the trailing zeros of x, such as those given in 
Section 2-1, “Manipulating Rightmost Bits," on page 11. These 
methods are also reasonable even if the machine has none of the 
bit-counting instructions. Using the algorithm for pop(x) given 
in Figure 5-2 on page 82, the first expression above executes in 
about 3 + 21 = 24 instructions (branch-free). 


Figure 5-20 shows an algorithm that does it directly, in 12 to 
20 basic RISC instructions (for x = 0). 


Click here to view code image 


int ntz(unsigned x) { 


int n; 

if (x == 0) return(32); 

п = 1; 

if ((x & Ox0000FFFF) == 0) {п =n + 16; x = x >>16;} 
if ((x 8 0x000000FF) == 0) (n =n + 8; x = x >> 8;} 
if ((x & 0х0000000Е) == 0) (n = n + 4; x = x >> 4;} 
if ((x & 0x00000003) == 0) (n =n + 2; x = x >> 2;} 
return n - (x & 1); 


FIGURE 5-20. Number of trailing zeros, binary search. 


The n + 16 can be simplified to 17 if that helps, and if the 
compiler is not smart enough to do that for you (this does not 
affect the number of instructions as we are counting them). 


Figure 5-21 shows a variation that uses smaller immediate 
values and simpler operations. It executes in 12 to 21 basic RISC 
instructions. Unlike the above procedure, when the number of 
trailing O's is small, the procedure of Figure 5-21 executes a 
larger number of instructions, but also a larger number of "fall- 
through" branches. 


Click here to view code image 


int ntz(unsigned x) { 
unsigned y; 


int n; 

if (x == 0) return 32; 

n = 31; 

у = x<<16; if (y != 0) (n =n -16; x = уй! 
у = х << 8; if (y != 0) (n =n - 8; x ур)! 
у = х << 4; if (y != 0) (n =n - 4; x = y;) 
у = х<< 2; if (y != 0) (n =n - 2; x = уй! 
у = х<< 1; if (y != 0) (n =n - 1;} 


return n; 


FIGURE 5-21. Number of trailing zeros, smaller immediate 
values. 


The line just above the return statement can alternatively be 
coded 


Click here to view code image 
n=n - ((x<< 1) >> 31); 


which saves a branch, but not an instruction. 


In terms of number of instructions executed, it is hard to beat 
the “search tree" [Aus2]. Figure 5-22 illustrates this procedure 
for an 8-bit argument. This procedure executes in seven 
instructions for all paths except the last two (return 7 or 8), 
which require nine. A 32-bit version would execute in 11 to 13 
instructions. Unfortunately, for large word sizes, the program is 
quite large. The 8-bit version above is 12 lines of executable 
source code and would compile into about 41 instructions. A 32- 
bit version would be 48 lines and about 164 instructions, and a 
64-bit version would be twice that. 


Click here to view code image 


int ntz(char x) { 
if (x & 15) { 
if (x & 3) ( 
if (x & 1) return 0; 
else return 1; 
} 
else if (x & 4) return 2; 
else return 3; 
} 
else if (x & 0x30) { 
if (x & 0x10) return 4; 
else return 5; 
} 
else if (х 8 0x40) return 6; 
else if (x) return 7; 
else return 8; 


FIGURE 5-22. Number of trailing zeros, binary search tree. 


If the number of trailing 0’s is expected to be small or large, 


then the simple loops shown in Figure 5-23 are quite fast. The 
algorithm on the left executes in 5 + 3ntz(x), and that on the 
right in 3 + 3(32 - ntz(x)) basic RISC instructions. 


Click here to view code image 


int ntz(unsigned x) { 


int n; 

x = «x & (x - 1); 

п = 0; // n = 32; 

while (x != 0) ( // while (x != 0) ( 
nc а 1} // По = Ts 
x = x >> 1; // x =x + x; 

} // } 

return n; // return n; 


FIGURE 5-23. Number of trailing zeros, simple counting loops. 


Dean Gaudet [Gaud] devised an algorithm that is interesting 
because with the right instructions it is branch-free, load-free 
(does not use table lookup), and has lots of parallelism. It is 
shown in Figure 5-24. 


Click here to view code image 


int ntz(unsigned x) { 
unsigned y, bz, b4, b3, b2, bl, b0; 


y = х & -х; // Isolate rightmost 1-bit. 
bz y ?0 s 1; // 1 if y = 0. 

b4 = (y & ΟΧΟΟΟΟΕΕΕΕ) ? 0: 16; 

b3 = (y & OxOOFFOOFF) ? 0 : 8; 

b2 = (y & 0х0ЕОЕОЕОЕ) ? 0: 4; 

bl = (y & 0х33333333) ? 0: 2; 

b0 = (y & 0x55555555) ? 0 1; 


return bz + b4 + b3 + b2 + bl +00; 


FIGURE 5-24. Number of trailing zeros, Gaudet's algorithm. 


As shown, the code uses the C “conditional expression” in six 
places. This construct has the form a?b:c. Its value is b if a is 
true (nonzero), and c if a is false (zero). Although a conditional 


expression must, in general, be compiled into compares and 
branches, for the simple cases in Figure 5-24 branching can be 
avoided if the machine has a compare for equality to zero 
instruction that sets a target register to 1 if the operand is 0, and 
to 0 if the operand is nonzero. Branching can also be avoided by 
using conditional move instructions. Using compare, the 
assignment to b3 can be compiled into five instructions on the 
basic RISC: two to generate the hex constant, an and, the 
compare, and a shift left of 3. (The first, second, and last 
conditional expressions require one, three, and four instructions, 
respectively.) 


The code can be compiled into a total of 30 instructions. All 
six lines with the conditional expressions can run in parallel. On 
a machine with a sufficient degree of parallelism, it executes in 
ten cycles. Present machines don't have that much parallelism, 
so as a practical matter it might help to change the first two uses 
of y in the program to x. This permits the first three executable 
statements to run in parallel. 


David Seal [Seal2] devised an algorithm for computing ntz(x) 
that is based on the idea of compressing the 232 possible values 
of x to a small dense set of integers and doing a table lookup. He 
uses the expression x & - x to reduce the number of possible 
values to a small number. The value of this expression is a word 
that contains a single 1-bit at the position of the least significant 
1-bit in x, or is O if x = 0. Thus, x & - x has only 33 possible 
values. But they are not dense; they range from 0 to 231. 


To produce a dense set of 33 integers that uniquely identify 
the 33 values of x & -х, Seal found a certain constant which, 
when multiplied by x & —x, produces the identifying value in the 
high-order six bits of the low-order half of the product of the 
constant and x & -χ. Since x & — x is an integral power of 2 or is 
0, the multiplication amounts to a left shift of the constant, or it 
is a multiplication by 0. Using only the high-order five bits is not 
sufficient, because 33 distinct values are needed. 


The code is shown in Figure 5-25, where table entries shown 
as u are unused. 


Click here to view code image 


int ntz(unsigned x) { 


static char table[64] - 


[(32,.0, Т.Е, Ὃν Ὁ; u;l3; Jy Ur Vp UU; Ue А, 
10, 4, а; u, 8, а, u,25;, απ, 311, αρ Ч, ü;21;27,15; 
31,41, δα, uy Wy Uy м, 9, 4, u,24, ч, u,20,290, 


30, u, u, u, u,23, u,19, 29, u,22,18,28,17,106, 1}; 


x = (x & -x) *0x0450FBAF; 
return table[x »» 26]; 


FIGURE 5-25. Number of trailing zeros, Seal's algorithm. 


As an example, if x is an odd multiple of 16, then x & -x — 
16, so the multiplication is simply a left shift of four positions. 
The high-order six bits of the low-order half of the product are 
then binary 010001, or 17 decimal. The table translates 17 to 4, 
which is the correct number of trailing 0’s for an odd multiple of 
16. 


There are thousands of constants that have the necessary 
uniqueness property. The smallest is 0x0431472F, and the 
largest is OxFDE75C6D. Seal chose a constant for which the 
multiplication can be done with a small number of shifts and 
adds. Since Ох0450ЕВАЕ = 17-65-65535, the multiplication can 
be done as follows: 


Click here to view code image 


х = (x<< 4) + x; // x = x*17. 
X = (x<< 6) + x; // x = x*65. 
x = (х << 16) - x; // x = x*65535. 


With this substitution, the code of Figure 5-25 consists of nine 
elementary instructions, plus an indexed load. Seal was 
interested in the ARM instruction set, which can do a shift and 
add in one instruction. On that architecture, the code is six 
instructions, including the indexed load. 


To make the multiplication even easier to do with shifts and 
adds, one might hope to find a constant of the form (2k; + 1) 
(2k2 + 1) that has the necessary uniqueness property. For a 
table size of 64, there are no such integers, and there is only one 
other suitable integer that is a product of three such factors: 
ΟΧΟΘΑΊΕΡΑΕ = 17 · 65 · 131071. Using a table size of 128 or 
256 does not help. However, for a table size of 512 there are 


four suitable integers of the form (2k; + 1)(2k2 + 1); the 
smallest is OXO080FF7F = 129 - 65535. We leave it to the reader 
to determine the table associated with this constant. 


There is a variation of Seal's method that is based on de 
Bruijn cycles [LPR]. These are cyclic sequences over a given 
alphabet that contain as a subsequence every sequence of the 
letters of the alphabet of a given length exactly once. For 
example, a cycle that contains as a subsequence every sequence 
of (a, b, c) of length 2 is aabacbbcc. Notice that the sequence ca 
wraps around from the end to the beginning. If the alphabet size 
is К and the length is n, there are КП sequences. For a cycle to 
contain all of these, it must be of length at least kn, which would 
be its length if a different sequence started at each position. It 
can be shown that there is always a cycle of this minimum 
possible length that contains all Кл sequences. 


For our purposes, the alphabet is (0, 1}, and for dealing with 
32-bit words, we are interested in a cycle that contains all 32 
sequences 00000, 00001, 00010, ..., 11111. Given such a cycle 
that begins with at least four 0’s, we can compute ntz(x) by first 
reducing x to a word that contains a single bit at the position of 
the least significant bit of x, as in Seal's algorithm. Then, by 
multiplication, we can select a 5-bit field of the de Bruijn cycle, 
which will be a unique value for each multiplier. This can be 
mapped to give the number of trailing 0’s by a table lookup. The 
algorithm follows. The de Bruijn cycle used is 


Click here to view code image 
0000 0100 1101 0111 0110 0101 0001 1111. 


It is in effect a cycle, because in use it has trailing 0’s beyond the 
32 bits shown above, which is effectively the same as wrapping 
to the beginning. 


There are 33 possible values of ntz(x) and only 32 five-bit 
subsequences in the de Bruijn cycle. Therefore, two words with 
different values of ntz(x) must map to the same number by the 
table lookup. The words that conflict are zero and words that 
end with a 1-bit. To resolve this, the code has a test for 0 and 
returns 32 in that case. A branch-free way to resolve it, useful if 
your computer has predicate comparison instructions, is to 
change the last statement to 


Click here to view code image 
return table[x >> 27] + 32*(x == 0); 


To compare the two algorithms, Seal's does not require the 
test for zero and it allows the alternative of doing the 
multiplication with six elementary instructions. The de Bruijn 
algorithm uses a smaller table. The de Bruijn cycle used in 
Figure 5-26, discovered by Danny Dubé [Dubé], is a good one 
because multiplication by it can be done with eight elementary 
instructions. The constant is 0х0407651Е = (2047 · 5 · 256 + 
1) · 31, from which опе can see the shifts, adds, and subtracts 
that do the job. 


Click here to view code image 


int ntz(unsigned x) { 


static char table[32] = 
[ 05 hy .2,24, 3,19. 6,25, 22, 4,20,10,16, 7,112,290; 


31,23,18, 55,215 Cyl Sy lly 30,17, 8,14,29,13,28,27); 
if (x == 0) return 32; 
х = (x & -x) *0x04D7651F; 


return table[x >> 27]; 


FIGURE 5-26. Number of trailing zeros using a de Bruijn cycle. 


John Reiser [Reiser] observed that there is another way to 
map the 33 values of the factor x « -x in Seal's algorithm to a 
dense set of unique integers: divide and use the remainder. The 
smallest divisor that has the necessary uniqueness property is 
37. The resulting code is shown in Figure 5-27, where table 
entries shown as u are unused. 


Click here to view code image 


int ntz(unsigned x) { 


static char table[37] = (32, 0, 1, 26, 2, 23, 27, 
и. ας 16,24, 305, 28; 11, “uy 3, 4; 
7, 17, u, 25, 22, 31, 15, 29, 10, 12, 
6, Uy, 21; l4, 9, 5, 20, 8, 19, L8]; 


x = (x & -х) $37; 
return table[x]; 


FIGURE 5-27. Number of trailing zeros, Reiser's algorithm. 


It is interesting to note that if the numbers x are uniformly 
distributed, then the average number of trailing O's is, very 
nearly, 1.0. To see this, sum the products рї, where pj is the 
probability that there are exactly nj trailing O's. That is, 


$x1.941.141.5 rL s: l 44. 5 + 
2 4 8 16 32 4 
ex V Er ШИ 
i 


To evaluate this sum, consider the following array: 


1/4 1/8 1/16 1/32 1/64 
1/8 1/16 1/32 1/64 
1/16 1/32 1/64 

1/32 1/64 

1/64 


The sum of each column is a term of the series for S. Hence S is 
the sum of all the numbers in the array. The sum of the rows are 


1/4 + 1/8 + 1/16 + 1/324 ... = 1/2 
1/8 + 1/16 + 1/32 + 1/644 ... = 1/4 
1/16 + 1/32 + 1/64 + 1/128 +... = 1/8 
and the sum of these is 1/2 + 1/4 + 1/8 + ... = 1. The 


absolute convergence of the original series justifies the 
rearrangement. 


Sometimes, a function similar to ntz(x) is wanted, but a 0 
argument is a special case, perhaps an error, that should be 
identified with a value of the function that’s easily distinguished 


from the *normal" values of the function. For example, let us 
define “the number of factors of 2 in x" to be 


ntz(x), х=0. 
-1, x=0. 


nfact2(x) = 


This can be calculated from 
31 — nlz(x & — x). 


Applications 


[GLS1] points out some interesting applications of the number of 
trailing zeros function. It has been called the “ruler function" 
because it gives the height of a tick mark on a ruler that's 
divided into halves, quarters, eighths, and so on. 


It has an application in R. W. Gosper’s loop-detection 
algorithm, which will now be described in some detail, because 
it is quite elegant and it does more than might at first seem 
possible. 


Suppose a sequence Xo,X1,X2, ... is defined by Xn + 1 = ΚΧπ). 
If the range of f is finite, the sequence is necessarily periodic. 
That is, it consists of a leader Хо, X1,..., Хил followed by a cycle 
Xy, Хи+1,.... Xu X. 1 that repeats without limit (Xu = Хи+^, Хи 
+1 = Xu + À + 1, and so on, where X is the period of the cycle). 
Given the function f, the loop-detection problem is to find the 
index и of the first element that repeats, and the period X. Loop 
detection has applications in testing random number generators 
and detecting a cycle in a linked list. 

One could save all the values of the sequence as they are 
produced and compare each new element with all the preceding 
ones. This would immediately show where the second cycle 
starts. But algorithms exist that are much more efficient in space 
and time. 

Perhaps the simplest is due to R. W. Floyd [Knu2, sec. 3.1, 
prob. 6]. This algorithm iterates the process 


f(x) 
I(fO)) 


X 


y 


with x and y initialized to Xo. After the nth step, x — Xn and 
y — Xon. These are compared, and if equal, it is known that Xn 
and Xon are separated by an integral multiple of the period A— 
that is, 2n — n = п is a multiple of X. Then и can be determined 
by regenerating the sequence from the beginning, comparing Xo 
to Xn, then X1 to Xn + 1, and so on. Equality occurs when Хи is 
compared to Хп +џ. Finally, X can be determined by regenerating 
more elements, comparing Xu to Xu + 1, Хи+ 2, ... This 
algorithm requires only a small and bounded amount of space, 
but it evaluates f many times. 


Gosper’s algorithm [HAK, item 132; Knu2, Answers to 
Exercises for Section 3.1, exercise 7] finds the period А, but not 
the starting point и of the first cycle. Its main feature is that it 
never backs up to reevaluate f, and it is quite economical in 
space and time. It is not bounded in space; it requires a table of 
size logo(A) + 1, where A is the largest possible period. This is 
not a lot of space; for example, if it is known a priori that A < 
232, then 33 words suffice. 


Gosper's algorithm, coded in C, is shown in Figure 5-28. This 
C function is given the function f being analyzed and a starting 
value Xo. It returns lower and upper bounds on u, and the period 
A. (Although Gosper's algorithm cannot compute и, it can 
compute lower and upper bounds py and ии such that uj — ш + 
1 < max(A — 1, 1).) The algorithm works by comparing Хи, for 
n = 1, 2, ..., to a subset of size (logon; + 1 of the elements of 
the sequence that precede Xn. The elements of the subset are the 
closest preceding X; such that i + 1 ends in a 1-bit (that is, i is 
the even number preceding n), the closest preceding X; such that 
i + 1 ends in exactly one O-bit, the closest preceding Xt such 
that i + 1 ends in exactly two O-bits, and so on. 


Click here to view code image 


void ld Gosper(int (*f)(int), int XO, int *mu 1, 
int*mu u, int *lambda) { 
int Xn, k, m, kmax, n, lgl; 


int T[33]; 
T[0] = X0; 
Xn = X0; 


kmax = 31 - nlz (n); // Floor(log2 n). 
for (К = 0; К <= kmax; k++) í 


if (Xn == T[k]) goto L; 

} 

T[ntz(n*1)] = Xn; // No match. 
} 

Τι; 

// Compute m = max{i | i « n and ntz(itl) = k}. 
m= ((((n >> κ) = 1) | 1) << k) = 1 
*lambda = n - m; 
lgl = 31 - nlz(*lambda - 1); // Ceil(log2 lambda) - 1. 
*mu u = m; // Upper bound on mu. 


*mu l = m - max(1, 1<< 191) + 1;// Lower bound on mu. 


FIGURE 5-28. Gosper's loop-detection algorithm. 


Thus, the comparisons proceed as follows: 


A: bore A NS PS UM A 
X, : Χρ, X, Xs : Xe Xs, Хз, X; Хү: Хр, Хз, Хи, X; 
X; : X4, X Χο: Xg, Xs, X5, X; Xis : Ху, λιν Ху, X7 
X, : Χρ, Xp X; Хү: Xs, Хо, Хз, X7 Ху: ХХ, Xib Ху, X15 
Xs: Χρ X X3 Хүү: Ху, Xo Аз, X7 Хуу: Хус, Xi Хи, X5 Хү, 
Xe : Xy Xs, X; Хр: Xio Χο, X3, X; Хү: Хув X15 Ху, Ху, X15 


It can be shown that the algorithm always terminates with n 
somewhere in the second cycle—that is, with n < и + 2X. See 
[Knu2] for further details. 


The ruler function reveals how to solve the Tower of Hanoi 
puzzle. Number the n disks from 0 to n — 1. At each move k, as 
k goes from 1 to 2n — 1, move disk ntz(k) the minimum 
permitted distance to the right, in a circular manner. 


The ruler function can be used to generate a reflected binary 
Gray code (see Section 13-1 on page 311). Start with an 
arbitrary n-bit word, and at each step k, as k goes from 1 to 2n 
— 1, flip bit ntz(k). 


Exercises 


1. Code Dubé’s algorithm for the ntz function, expanding the 
multiplication. 


2. Code the “right justify" function, Х >> ntz(X) x = 0, in 
three basic RISC instructions. 
3. Are the parallel prefix and suffix (with XOR) operations 


invertible? If so, how would you compute the inverse 
functions? 


Chapter 6. Searching Words 


6-1 Find First 0-Byte 


The need for this function stems mainly from the way character 
strings are represented in the C language. They have no explicit 
length stored with them; instead, the end of the string is denoted 
by an all-0 byte. To find the length of a string, a C program uses 
the "strlen" (string length) function. This function searches the 
string, from left to right, for the O-byte, and returns the number 
of bytes scanned, not counting the O-byte. 


A fast implementation of “strlen” might load and test single 
bytes until a word boundary is reached, and then load a word at 
a time into a register, and test the register for the presence of a 
O-byte. On big-endian machines, we want a function that returns 
the index of the first 0-byte from the left. A convenient encoding 
is values from 0 to 3 denoting bytes 0 to 3, and a value of 4 
denoting that there is no O-byte in the word. This is the value to 
add to the string length, as successive words are searched, if the 
string length is initialized to 0. On little-endian machines, one 
wants the index of the first O-byte from the right end of the 
register, because little-endian machines reverse the four bytes 
when a word is loaded into a register. Specifically, we are 
interested in the following functions, where “00” denotes a 0- 
byte, “nn” denotes a nonzero byte, and “xx” denotes a byte that 
may be 0 or nonzero. 


0, x = 00xxxxxx, 0, x = xxxxxx00, 
1, x = nn0O0xxxx, 1, x = хххх00пп, 
zbytel(x) = 42. x = пппп00хх, zbyter(x) = 42. x = xx00nnnn, 
3, x = nnnnnn00O, 3, x = OOnnnnnn. 
4, x = nnnnnnnn. 4, x = nnnnnnnn. 


Our first procedure for the find leftmost O-byte function, 
shown in Figure 6-1, simply tests each byte, in left-to-right 
order, and returns the result when the first O-byte is found. 


Click here to view code image 


int zbytel(unsigned x) { 


if ((x >> 24) == 0) return 0; 
else if ((x & OxOOFF0000) == 0) return 1; 
else if ((x & 0х0000ЕЕ00) == 0) return 2; 
else if ((x & 0х000000ЕЕ) == 0) return 3; 


else return 4; 


FIGURE 6-1. Find leftmost O-byte, simple sequence of tests. 


This executes in two to 11 basic RISC instructions, 11 in the 
case that the word has no 0-bytes (which is the important case 
for the “strlen” function). A very similar program will handle the 
problem of finding the rightmost 0-byte. 


Figure 6-2 shows a branch-free procedure for this function. 
The idea is to convert each 0-Буе to 0x80, and each nonzero 
byte to 0х00, and then use number of leading zeros. This 
procedure executes in eight instructions, if the machine has the 
number of leading zeros and nor instructions. Some similar tricks 
are described in [Lamp]. 


Click here to view code image 


int zbytel(unsigned x) { 

unsigned y; 
int n; 

// Original byte: 00 80 other 
у = (x & Ох7ЕТЕТЕТЕ) + 0х7Е7Е7Е7Е, // ТЕ ΤΕ 1ххххххх 
у = “(y 1 x 1 Ox7F7F7F7F); // 80 00 00000000 
п = nlz(y) >> 3; //n=0... 4, 4 if x 
return n; // has no O-byte. 


FIGURE 6-2. Find leftmost 0-Буе, branch-free code. 


The position of the rightmost 0-byte is given by the number of 
trailing 05 in the final value of y computed above, divided by 8 
(with fraction discarded). Using the expression for computing 
the number of trailing 0’s by means of the number of leading zeros 
instruction (see Section 5-4, “Counting Trailing 0’s,” on page 
107), this can be computed by replacing the assignment to n in 
the procedure above with: 


Click here to view code image 


п = (32- nlz(~y 8 (у = 1))) >> 3; 


This is a 12-instruction solution, if the machine has nor and and 
not. 


In most situations on PowerPC, incidentally, a procedure to 
find the rightmost O-byte would not be needed. Instead, the 
words can be loaded with the load word byte-reverse instruction 
(iwbrx). 


The procedure of Figure 6-2 is more valuable on a 64-bit 
machine than on a 32-bit one, because on a 64-bit machine the 
procedure (with obvious modifications) requires about the same 
number of instructions (seven or ten, depending upon how the 
constant is generated), whereas the technique of Figure 6-1 
requires 23 instructions worst case. 


If only a test for the presence of a 0-Буе is wanted, then a 
branch on zero (or nonzero) can be inserted just after the second 
assignment to y. 


A method similar to that of Figure 6-2, but for finding the 
rightmost 0-byte іп a word x (zbyter(x)), is [Mycro]: 


Click here to view code image 


(x - 0x01010101) & -х & 0x80808080; 
ntz(y) >> 3; 


y 
n 


This executes in only five instructions exclusive of loading the 
constants if the machine has the and not and number of trailing 
zeros instructions. It cannot be used to compute zbytel(x), 
because of a problem with borrows. It would be most useful for 
finding the first O-byte in a character string on a little-endian 
machine, or to simply test for a O-byte (using only the 
assignment to y) on a machine of either endianness. 


If the nlz instruction is not available, there does not seem to 
be any really good way to compute the find first O-byte function. 
Figure 6-3 shows a possibility (only the executable part of the 
code is shown). 


This executes in ten to 13 basic RISC instructions, ten in the 
all-nonzero case. Thus, it is probably not as good as the code of 
Figure 6-1, although it does have fewer branch instructions. It 
does not scale very well to 64-bit machines, unfortunately. 


There are other possibilities for avoiding the nlz function. 


The value of y computed by the code of Figure 6-3 consists of 
four bytes, each of which is either 0x00 or 0x80. The remainder 
after dividing such a number by Ox7F is the original value with 
the up-to-four 1-bits moved and compressed to the four 
rightmost positions. Thus, the remainder ranges from 0 to 15 
and uniquely identifies the original number. For example, 


remu(0x80808080, 127) — 15. 
remu(0x80000000, 127) — 8, 
remu(0x00008080, 127) = 3. εἰς. 


This value can be used to index a table, 16 bytes in size, to get 
the desired result. Thus, the code beginning if (y == 0) can be 
replaced with 


Click here to view code image 
static char table[16] = (4, 3, 2 
return table[y%127]; 


where y is unsigned. The number 31 can be used in place of 
127, but with a different table. 


Click here to view code image 


// Original byte: 00 80 other 


y = (x 8 0х7Е7Е7Е7Е) + 0х7Е7Е7Е7Е, // ΤΕ ΤΕ 1ххххххх 

у = “(у | х | Ox7F7F7F7F); // 80 00 00000000 
// These steps map: 

if (y == 0) return 4; // 00000000 ==> 4, 
else if (y > 0х0000ЕЕРР) // 80xxxxxx ==> 0, 
return (y >> 31) ^ 1; // 0080xxxx ==> 1, 
else // 000080xx --» 2, 
return (y »» 15) ^ 3; // 00000080 ==> 3. 


FIGURE 6-3. Find leftmost O-byte, not using nız. 


These methods involving dividing by 127 or 31 are really just 
curiosities, because the remainder function is apt to require 20 
cycles or more, even if directly implemented in hardware. 
However, below are two more efficient replacements for the 
code in Figure 6-3 beginning with if (y == 0): 


Click here to view code image 


return table[hopu(y, 0x02040810) & 15]; 
return table[y*0x00204081 »» 28]; 


Here, hopu(a, b) denotes the high-order 32 bits of the unsigned 
product of a and ». In the second line, we assume the usual HLL 
convention that the value of the multiplication is the low-order 
32 bits of the complete product. This might be a practical 
method, if either the machine has a fast multiply or the 
multiplication by 0x204081 is done by shift-and-add's. It can be 
done in four such instructions, as suggested by 


ΥΩ + 27 + 214 + 221) = y (1 + 27) + 214). 


Using this 4-cycle way to do the multiplication, the total time for 
the procedure comes to 13 cycles (7 to compute y, plus 4 for the 
shift-and-add’s, plus 2 for the shift right of 28 and the table 
index), and of course it is branch-free. 


These scale reasonably well to a 64-bit machine. For the 
“modulus” method, use 
Click here to view code image 


return table[y$511]; 


where table is of size 256, with values 8, 0, 1, 0, 2, 0, 1, 0, 3, 0, 
1, 0, 2, 0, 1, 0, 4, ... (ie, table{i] = number of trailing 0’s in 
i). 

For the multiplicative methods, use either 


return table[hopu(y, 0x02040810 20408100) & 255]; or 
return table[(y*0x00020408 10204081>>56]; 


where table is of size 256, with values 8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 
4, 4, 4, 4, 4, 4, 3, .... 


The multiplication by 0x20408 10204081 can be done with 
t, e y(1 +27) 
t, €- t, (17 214) 
1, < t,(1 + 228) 


which gives a 13-cycle solution. 


All these variations using the table can, of course, implement 
the find rightmost O-byte function by simply changing the data in 
the table. 


If the machine does not have the nor instruction, the not in 
the second assignment to y in Figure 6-3 can be omitted, in the 
case of a 32-bit machine, by using one of the three return 
statements given above, with tableti] = 0, 0, 0, 0, 0, 0, 0, O, 
1, 1, 1, 1, 2, 2, 3, 4. This scheme does not quite work on a 64-bit 
machine. 


Here is an interesting variation on the procedure of Figure б- 
2, again aimed at machines that do not have number of leading 
zeros. Let a, b, c, and d be 1-bit variables for the predicates “the 
first byte of x is nonzero,” “the second byte of x is nonzero,” and 
so on. Then, 


zbytel(x) = a + ab + abc + abcd. 


The multiplications can be done with and’s, leading to the 
procedure shown in Figure 6-4 (only the executable code is 
shown). This comes to 15 instructions on the basic RISC, which 
is not particularly fast, but there is a certain amount of 
parallelism. On a superscalar machine that can execute up to 
three arithmetic instructions in parallel, provided they are 
independent, it comes to only ten cycles. 


Click here to view code image 


у = (x & 0х7Е7Е7Е7Е) + Ox7F7F7F7F; 

у= у | х; // Leading 1 on nonzero bytes. 
tl = у >> 31; // tl = a. 

t2 = (у >> 23) & tl; // t2 = ab. 

t3 = (у >> 15) & t2; // t3 = abc. 

t4 = (y >> 7) & t3; // t4 = abcd. 

return tl + t2 + t3 + t4; 


FIGURE 6-4. Find leftmost 0-byte by evaluating a polynomial. 


A simple variation of this does the find rightmost O-byte 
function, based on 


zbyter(x) = abcd + bcd + cd + d. 
(This requires one more and than the code of Figure 6—4.) 


Some Simple Generalizations 


Functions zbytel and zbyter can be used to search for a byte 
equal to any particular value, by first exclusive oring the 
argument x with a word consisting of the desired value 
replicated in each byte position. For example, to search x for an 
ASCII blank (0x20), search x Ө Ох 20202020 for a O-byte. 


Similarly, to search for a byte position in which two words x 
and y are equal, search x Фу for a O-byte. 


There is nothing special about byte boundaries in the code of 
Figure 6-2 and its variants. For example, to search a word for a 
0-value in any of the first four bits, the next 12, or the last 16, 
use the code of Figure 6-2 with the mask replaced by 
0х77ЕЕ7ЕЕЕ [PHO]. (If a field length is 1, use a 0 in the mask at 
that position.) 


Searching for a Value in a Given Range 


The code of Figure 6-2 can easily be modified to search for a 
byte in the range O to any specified value less than 128. To 
illustrate, the following code finds the index of the leftmost byte 
having value from 0 to 9: 


Click here to view code image 


y = (x & 0х7Е7Е7Е7Е) + 0x76767676; 

y=y | х/ 

у = у | 0х7Е7Е7Е7Е/ // Bytes > 9 are OxFF. 
у = "yi // Bytes > 9 are 0x00, 


// bytes <= 9 are 0x80. 


Ë 
| 


- nlz(y) »» 3; 


More generally, suppose you want to find the leftmost byte in 
a word that is in the range a to b, where the difference between 
a and b is less than 128. For example, the uppercase letters 
encoded in ASCII range from 0x41 to Ox5A. To find the first 
uppercase letter in a word, subtract 0x41414141 in such a way 
that the borrow does not propagate across byte boundaries, and 
then use the above code to identify bytes having value from 0 to 
0x19 (0x5A - 0x41). Using the formulas for subtraction given in 
Section 2-18, “Multibyte Add, Subtract, Absolute Value," on 


page 40, with obvious simplifications possible with y = 
0x41414141, gives 


Click here to view code image 


d = (x | 0x80808080) - 0x41414141; 

а = ~((x | Ox7F7F7F7F) ^ а); 

y = (d & Ox7F7F7F7F) + 0x66666666; 

y=y | 3; 

у = у | 0х7Е7Е7Е7Е, // Bytes not from 41-5A are FF. 
у = ~y; // Bytes not from 41-5A are 00, 


// bytes from 41-5A are 80. 
п = nlz(y) >> 3; 


For some ranges of values, simpler code exists. For example, 
to find the first byte whose value is 0x30 to 0x39 (a decimal 
digit encoded in ASCII), simply exclusive or the input word with 
0x30303030 and then use the code given above to search for a 
value in the range O to 9. (This simplification is applicable when 
the upper and lower limits have n high-order bits in common, 
and the lower limit ends with 8 — n 0’s.) 


These techniques can be adapted to handle ranges of 128 or 
larger with no additional instructions. For example, to find the 
index of the leftmost byte whose value is in the range 0 to 137 
(0x89), simply change the line y = y | xto y = y & xin the 
code above for searching for a value from 0 to 9. 


Similarly, changing the line y = y | ato y = y «ain the 
code for finding the leftmost byte whose value is in the range 
0x41 to Ox5A causes it to find the leftmost byte whose value is 
in the range 0x41 to OxDA. 


6-2 Find First String of 1-Bits of a Given Length 


The problem here is to search a word in a register for the first 
string of 1-bits of a given length n or longer, and to return its 
position, with some special indication if no such string exists. 
Variants are to return only the yes/no indication and to locate 
the first string of exactly n 1-bits. This problem has application 
in disk-allocation programs, particularly for disk compaction 
(rearranging data on a disk so that all blocks used to store a file 
are contiguous). The problem was suggested to me by Albert 
Chang, who pointed out that it is one of the uses for the number 
of leading zeros instruction. 


We assume here that the number of leading zeros instruction, 
or a suitable subroutine for that function, is available. 


An algorithm that immediately comes to mind is to first count 
the number of leading 0’s and skip over them by shifting left by 
the number obtained. Then count the leading 1’s by inverting 
and counting leading O's. If this is of sufficient length, we are 
done. Otherwise, shift left by the number obtained and repeat 
from the beginning. This algorithm might be coded as shown 
below. If n consecutive 1-bits are found, it returns a number 
from 0 to 31, giving the position of the leftmost 1-bit in the 
leftmost such sequence. Otherwise, it returns 32 as a *not found" 
indication. 


Click here to view code image 


int ffstrl(unsigned x, int n) { 


int k, p; 
р = 0; // Initialize position to 
return. 
while (x != 0) { 
k = nlz(x); // Skip over initial 0's 
x = х<< k; // (if any). 
p =p + k; 
k = nlz(^x); // Count first/next group of 
Iss 
if (К >= n) // If enough, 
return p; // return. 
x = х<< k; // Not enough 1's, skip over 
p = p + k; // them. 


) 


return 32; 


This algorithm is reasonable if it is expected that the loop 
will not be executed very many times—for example, if it is 
expected that x will have long sequences of 1’s and of 0’s. This 
might very well be the expectation in the disk-allocation 
application. Its worst-case execution time, however, is not very 
good; for example, about 178 full RISC instructions executed for 
x = 0x55555555 and n = 2. 


An algorithm that is better in worst-case execution time is 
based on a sequence of shift left and and instructions. To see how 
this works, consider searching for a string of eight or more 
consecutive 1-bits in a 32-bit word x. This might be done as 


follows: 


x< x&(x=< 1) 
x + x & (x <2) 


x — x & (x < 4) 


After the first assignment, the 1’s in x indicate the starting 
positions of strings of length 2. After the second assignment, the 
15 in x indicate the starting positions of strings of length 4 (a 
string of length 2 followed by another string of length 2). After 
the third assignment, the 1’s in x indicate the starting positions 
of strings of length 8. Executing number of leading zeros on this 
word gives the position of the first string of length 8 (or more), 
or 32 if none exists. 


To develop an algorithm that works for any length n from 1 
to 32, we will look at this a little differently. First, observe that 
the above three assignments can be done in any order. Reverse 
order will be more convenient. To illustrate the general method, 
consider the case n — 10: 


x x &(x <5) 
X, — x, & (x, K2) 
X, €- x, & (x, < 1) 


X4 — X4 & (x; << 1) 


The first statement shifts by n/2. After executing it, the 
problem is reduced to finding a string of five consecutive 1-bits 
in x1. This can be done by shifting left by |5/2| = 2, and'ing, 
and searching the result for a string of length 3 (5 - 2). The last 
two statements identify where the strings of length 3 are in x2. 
The sum of the shift amounts is always n- 1. The algorithm is 
shown in Figure 6-5. The execution time ranges from 3 to 36 
full RISC instructions, as n ranges from 1 to 32. 


Click here to view code image 


int ffstrl(unsigned x, int n) { 


while (n > 1) { 


s = n >> 1; 
x = x & (x >> s); 
п= ту S; 


return nlz(x); 


FIGURE 6-5. Find first string of n 1’s, shift-and-and sequence. 


If n is often moderately large, it is not unreasonable to unroll 
this loop by repeating the loop body five times and omitting the 
test n » 1. (Five is always sufficient for a 32-bit machine.) This 
gives a branch-free algorithm that runs in a constant time of 20 
instructions executed (the last assignment to n can be omitted). 
Although for small values of n, the three assignments are 
executed more than necessary, the result is unchanged by the 
extra steps, because variable n sticks at the value 1, and for this 
value the three steps have no effect on x or n. The unrolled 
version is faster than the looping version for n = 5, in terms of 
number of instructions executed. 


A string of exactly n 1-bits can be found in six more 
instructions (four if and not is available). The quantity x 
computed by the algorithm of Figure 6-5 has 1-bits wherever a 
string of length n or more 1-bits begins. Hence, using the final 
value of x computed by that algorithm, the expression 


x & A(x $ 1) & a(x < 1) 


contains a 1-bit wherever the final x contains an isolated 1-bit, 
which is to say wherever the original x began a string of exactly 
n 1-bits. 


The algorithm is also easily adapted to finding strings of 
length n that begin at certain locations. For example, to find 
strings that begin at byte boundaries, simply and the final x with 
0x80808080. 


It can be used to find strings of O-bits either by 
complementing x at the start, or by changing the and's to or's 
and complementing x just before invoking nlz. For example, 
below is an algorithm for finding the first (leftmost) O-byte (see 


Section 6-1, “Find First O-Byte," on page 117, for a precise 
definition of this problem). 


x €x | (x«4) 
x<—x | (x <2) 
xex | (x < 1) 


x — 0x7F7F7F7F 


x 


р < nlz(—x) > 3 


This executes in 12 instructions on the full RISC (not as good as 
the algorithm of Figure 6-2 on page 118, which executes in 
eight instructions). 


6-3 Find Longest String of 1-Bits 


The nicely concise function shown in Figure 6-6 returns the 
length of the longest string of 1-bits in x [Hsieh]. 


Click here to view code image 


int maxstrl(unsigned x) { 
int k; 
for (k 0; x ! = 0; k++) x = x & 2*x; 
return k; 


FIGURE 6-6. Find length of longest string of 1’s. 


It executes in 4n + 3 instructions on the basic RISC, where n 
is the length of the longest string of 1’s, or 131 instructions in 
the worst case. 


To reduce the worst-case execution time, a “logarithmic” 
version is possible. It works by propagating 05 one, two, four, 
eight, and 16 positions to the left, stopping at the last nonzero 
word, and then backtracking to find the length of the longest 
contiguous string of 1's. 


For example, suppose 


Click here to view code image 


x = 0011 1111 1111 0011 1111 0011 1111 1000 
Then 
Click here to view code image 
x2 = 0011 1111 1110 0011 1110 0011 1111 0000 
x4 = 0011 1111 1000 0011 1000 0011 1100 0000 


x8 = 0011 1000 0000 0000 0000 0000 0000 0000 
x16 = all O's 


In this case, the last nonzero word is x8. Observe that each 
1-bit in xs indicates the leftmost position of a string of eight 1’s. 
Thus, the longest string of 1’s begins at the leftmost position of a 
1-bit in x8, bit position 29 in the example. To test for a string of 
length 12, one can test the bit at position 21 (29 - 8) in ха. 
Since that is 0, there is no string of length 12. To test for a string 
of length 10, one can test the bit at position 21 in x2. Since that 
is 1, position 29 is the start of a string of length 10 (or more). 
Last, to test for a string of length 11, one can test the bit at 
position 19 (21 — 2) in x. Because that is О, the longest string is 
of length 10, and it starts at position 29. 


This scheme is coded in Figure 6-7, except the code uses only 
two variables, x and y, instead of the five variables x, x2, x4, 
x8, and x16. This code finds both the length and position of the 
longest string of 1’s, with the position being measured from the 
left end of the string. The scheme does not work if x is O or all 
175. These are special-cased, with the latter possibility being 
handled in a place that is not executed frequently. 


Click here to view code image 


int fmaxstrl(unsigned x, int *apos) { 
unsigned y; 


int s; 

if (x == 0) {*apos = 32; return 0;} 
y = X & (x << 1); 

if (y == 0) (s = 1; goto L1;) 

x = у & (у << 2); 

if (x == 0) {s = 2; x = y; goto L2;} 
y = x & (x << 4); 

if (y == 0) (s = 4; goto 14;) 


x= у & (y<< 8); 
if (x == 0) {s = 8; x = y; goto 18;} 


if (x == OxFFFF8000) {*apos = 0; return 32;} 


S = 16; 
116: у = x & (x<< 8); 
if (у ! 0) (s =s + 8; х = yi) 
L8 y = x & (x<< 4); 
if (y != 0) (s = s + 4; x = y;} 
L4 y = x & (x << 2); 
if (y != 0) (s =з + 2; x = у; } 
L2: y= x & (x<< 1); 
if (y != 0) (s = s +1; x = y;} 
Tid. *apos - nlz(x); 
return S; 


FIGURE 6-7. Find length and position of longest string of 1’s. 


The worst-case execution time on the basic RISC is 39 
instructions, plus those required for the nlz function. If only the 
length of the longest string of l's is wanted, there is no 
significant savings in execution time, except for omitting the use 
of the nlz function. 


6-4 Find Shortest String of 1-Bits 


It is more difficult to find the shortest string of 1-bits in a word. 
One way to do it is to mark the beginnings of all strings of 1’s in 
a word b and the ends of all such strings in a word e. Then, if b 
& e is nonzero, the shortest string is of length 1. Otherwise, shift 
e left one position and test again. For example, if 


Click here to view code image 

x = 0011 1111 1111 0011 1111 0011 1111 1000 
then 
Click here to view code image 


b = 0010 0000 0000 0010 0000 0010 0000 0000 
e = 0000 0000 0001 0000 0001 0000 0000 1000 


After shifting e left five places, b & e is nonzero. This means 
that the shortest string of 1-bits is of length 6. 


This idea is embodied in the code shown in Figure 6-8. As in 
the preceding material, the position of the string is measured 


from the left, and if there are two or more minimal length 
strings of equal length, this function finds the leftmost one. For 
example, if x = OxOOFFOFFO it returns length 8, position 8. 


Click here to view code image 


int fminstrl(unsigned x, int *apos) { 


int k; 
unsigned b, e; // Beginnings, ends. 
if (x == 0) {*apos = 32; return 0;} 
b = ~(x >> 1) & x; // 0-1 transitions. 
e= x & ~(x<< 1); // 1-0 transitions. 
for (К = 1; (b&e) == 0; К++) 

е = е << 1; 
*apos = п12 (р & e); 
return К; 


FIGURE 6-8. Find length and position of shortest string of 
175. 


The function executes in 8 + 4 n instructions on the basic 
RISC (without andc), plus the time for the nlz function, for n = 
2, where n is the length of the shortest contiguous string of 1’s in 


X. 


Perhaps the ultimate problem in this class is to find the 
length and position of the shortest string of 1’s in x that is at 
least as long as a given integer n> 0. In terms of the storage 
allocation problem, this is a "best fit" algorithm. This can be 
done by first left-propagating the 0’s in x by n — 1 positions and 
then finding the shortest string of 1’s in the revised x. See the 
exercises. 


Exercises 


1. Code an elaboration of Hsieh’s algorithm that will find 
both the length and position of the longest string of 1’s in 
a word x. You may use the nlz function. 


2. Code a function for finding the length and position of the 
shortest string of 1’s in a word x that is at least as long as 
a given integer n. 


3. Another way to find the shortest string of 1’s in a word x 


is to successively turn off the rightmost string of 1’s in x 
and observe the change in population count at each step. 
Code a function for the full RISC that uses this idea and 
also finds the position of a shortest string of 1’s. 


For “completely random" 32-bit words x (each bit 
independently O or 1 with probability 0.5), what is the 
average number of strings of l's in x? The answer 
determines the average execution time of the function of 
exercise 3, for such input data. 


. Again, for “completely random” 32-bit words x, what is 
the average length of the shortest contiguous string of 1’s 
in x? The answer determines the average execution time 
of function £minstri in Figure 6-8 for such input data. 
Compute this with a Monte Carlo or exhaustive 
enumeration program. 


. Of the 2n binary words of length n, for how many is their 
shortest contained string of 1’s of length 1? That is, how 
many n-bit words begin with 10, or end with 01, or 
contain the sequence 010? Find a closed-form solution or 
a recursion, not an exhaustive enumeration program. 


. Similarly, of the 2n binary words of length n, for how 
many is their shortest contained string of 1’s of length 2? 


Chapter 7. Rearranging Bits and Bytes 


7-1 Reversing Bits and Bytes 


By "reversing bits" we mean to reflect the contents of a register 
about the middle so that, for example, 


rev(0x01234567) = OxE6A2C480. 


By “reversing bytes” we mean a similar reflection of the four 
bytes of a register. Byte reversal is a necessary operation to 
convert data between the “little-endian” format used by DEC 
and Intel, and the “big-endian” format used by most other 
manufacturers. 


Bit reversal can be done quite efficiently by interchanging 
adjacent single bits, then interchanging adjacent 2-bit fields, and 
so on, as shown below [Aus1]. These five assignment statements 
can be executed in any order. This is the same algorithm as the 
first population count algorithm of Section 5-1, but with addition 
replaced with swapping. 


Click here to view code image 


x (х & 0x55555555) << 1 | (x & OXAAAAAAAA) >> 1; 
x = (x & 0x33333333) << 2 | €x & OxCCCCCCCE). >> 2; 
x = (x & OxOFOFOFOF) << 4 | (x & OxFOFOFOFO) >> 4; 
x = (x & OxOOFFOOFF) << 8 | (x & OxFFOOFF00) >> 8; 
x (х & OxOOOOFFFF) << 16 | (x & OxFFFFOO00) >> 16; 


A small improvement may result on some machines by using 
fewer distinct large constants and doing the last two assignments 
in a more straightforward way, as shown in Figure 7-1 (30 basic 
RISC instructions, branch-free). 


Click here to view code image 


unsigned rev(unsigned x) { 

x = (x 8 0х55555555) << 1 | (x >> 1) & 0х55555555: 
x & 0x33333333) << 2 | (x >> 2) & 0x33333333; 
x & OxOFOFOFOF) << 4 | (x >> 4) & OxOFOFOFOF; 
х << 24) | ((х & OxFF00) << 8) | 
(x >> 8) & OxFFO0) | (x >> 24); 


x 
b E 
x 


( 
( 
=e 
( 


return x; 


FIGURE 7-1. Reversing bits. 


The last assignment to x in this code does byte reversal in 
nine basic RISC instructions. If the machine has rotate shifts, 
however, this can be done in seven instructions with 


х «- ((x ἃ OxOOFFOOFF) 3 8) | ((x 8) & Ox00FFOOFF). 


PowerPC can do the byte-reversal operation in only three 
instructions [Hay1]: a rotate left of 8, which positions two of the 
bytes, followed by two “rlwimi” (rotate left word immediate then 
mask insert) instructions. 


The next algorithm, by Christopher Strachey [Strach 1961], is 
old by computer standards, but it is instructive. It reverses the 
rightmost 16 bits of a word, assuming the leftmost 16 bits are 
clear at the start, and places the reversed halfword in the left 
half of the register. 


Its operation is based on the number of bit positions that each 
bit must move. The 16 bits, taken from left to right, must move 
1, 3, 5, ..., 31 positions. The bits that must move 16 or more 
positions are moved first, then those that must move eight or 
more positions, and so forth. The operation is illustrated below, 
where each letter denotes a single bit, and a period denotes a 
“don’t care" bit. 


Click here to view code image 


0000 0000 0000 0000 abcd efgh ijkl mnop Given 


0000 0000 ijkl mnop abcd efgh .... .... After shl 16 
0000 mnop ijkl efgh abcd .... .... .... After shl 8 
00ор mnkl ijgh екса ab.. .... .... .... After shl 4 
Opon mlkj ihgf edcb a... .... .... .... After shl 2 
ponm lkji hgfe dcba .... .... .... .... After shl 1 


Straightforward code consists of 16 basic RISC instructions, 
plus 12 to load the constants: 


Click here to view code image 


x = x | ((х 8 0x000000FF) << 16); 
x = (x & OxFOFOFOFO) | ((x & OxOFOFOFOF) << 8); 
( 


X = (x & OxCCCCCCCC) | ((x & 0x33333333) << 4); 


x (x & OxAAAAAAAA) | ((x & 0x55555555) << 2); 
x = x << 1; 


Complementation can be used to reduce the number of 
distinct masks. By using more irregular masks, the rightmost 16 
bits can be preserved. 


If rotate shifts are available, Strachey's idea can be used to 
reverse a 32-bit word. The idea is to consider how many bit 
positions each bit must move rotationally to the left to get to its 
final position. Taking the bits from left to right, the shift 
amounts are 1, 3, 5, ..., 31, 1, 3, 5, ..., 31 (no bit moves an even 
number of positions). The algorithm first rotate-moves those bits 
that must move 16 or more positions, then those that must move 
eight or more positions, and so forth, and finally those that must 
move one position (which is all of the bits, because all move 
amounts are odd). This scheme is shown below, for reversing a 
32-bit word x. Function shir(x, y) rotates х left y positions. 


Click here to view code image 


x = shlr(x & OxOOFFOOFF, 16) | х & -OxOOFFOOFF; 
x = shlr(x & OxOFOFOFOF, 8) | х & «*OxOFOFOFOF; 
x = shlr(x & 0x33333333, 4) | x & ~0x33333333; 
х = shlr(x & 0х55555555, 2) | x ἃ ~0x55555555; 
x = shlr(x, 1); 


The code uses and with complement to avoid loading some 
masks. If your machine does not have that instruction, it can be 
avoided by rewriting the first line of code as 


Click here to view code image 
x = shlr(x, 16) 8 ΟΚΟΟΕΕΟΟΕΕ | х 8 -0х00ЕЕООРР, 
which is a MUX operation, and using the identity 
х&т|у& т = ((xOy)&m) Өу 
to obtain 
Click here to view code image 
x = ((shlr(x, 16) ^ x) & OxOOFFOOFF) ^ x; 


and similarly for the other lines that have and with complement. 
A slightly better way for many machines, in that it has a little 


instruction-level parallelism, is to use the identity [Karv] 


x & —m = (x & m) @ x. 
and common the and expression. This gives the function shown 


in Figure 7-2 (17 instructions, plus eight to load constants, or 25 
in all). 


Click here to view code image 


unsigned rev(unsigned x) 
unsigned t; 


— 


Е = x & OxOOFFOOFF; x = shlr(t, 16) | t ^ x; 
Е = x & OxOFOFOFOF; x = shlr(t, 8) | Е ^ x; 
Е = x & 0x33333333; x = shlr(t, 4) | t “ x; 
Е = x & 0x55555555; x = shlr(t, 2) | t^ x; 
x = ghlr(x, 1); 


return x; 


FIGURE 7-2. Reversing bits with rotate shifts. 


It is perhaps worth noting that the constants OxOOFFOOFF, 
OxOFOFOFOF, and so on can be generated one from another as 
shown below. This is not useful for 32-bit machines (it may even 
be harmful by reducing parallelism), because 32-bit RISC 
machines generally can load the constants in two instructions. 
But it might be useful for a 64-bit machine, for which it is 
illustrated. 


C, < 0x00000000 FFFF FFFF 
C, < C, 6 (C, < 16) 
C, — C, © (C, < 8) 


Another way to reverse bits is to break the word up into three 
groups of bits, and swap the leftmost and rightmost groups, 
leaving the center group in place [Baum]. For a 27-bit word, this 
works as illustrated below. 


Click here to view code image 


012345678 9abcdefgh ijklmnopq The given 27-bit word 
ijklmnopq 9abcdefgh 012345678 First ternary swap 
opqimnijk fghcde9ab 678345012 Second ternary swap 
qponmlkji hgfedcba9 876543210 Third ternary swap 


Straightforward code for this follows. If run on a 32-bit 
machine, it reverses bits O to 26, placing the result in bit 
positions 0 to 26, and clearing bits 27 to 31. 


Click here to view code image 


х = & 0Ox000001FF) << 18 | (x & 0х0003ЕЕ00) | 
>> 18) & 0x000001FF; 

& 0х001С0Е07) << 6 | (x & 0x00E07038) | 
>> 6) & 0х001С0Е07: 

& 0х01249249) << 2 | (x & 0x02492492) 


>> 2) & 0x01249249; 


x хх x x x 


( 
( 
( 
( 
( 
( 


This amounts to 21 basic RISC instructions, plus 10 to load the 
constants, or 31 in all. In comparison, the code of Figure 7-1 is 
24 basic RISC instructions, plus six to load constants, plus a shift 
right of 5 to right-justify the result, or 31 in all. Thus, the 
ternary method is equal or superior when there are 27 or fewer 
bits to be reversed. 


The next function, by Donald E. Knuth [Knu8], is interesting 
because it reverses a 32-bit word with only four stages, and the 
shifting and masking steps are unexpectedly irregular. It uses 
one rotate shift and three ternary swaps. It works as follows: 


Click here to view code image 


01234567 89abcdef ghijklmn opqrstuv Given 

fghijklm nopqrstu v0123456 789abcde Rotate left 15 
pqrstuvm nofghijk labcde56 78901234 10-swap 
tuvspqrm nojklifg hebcda96 78541230 4-swap 
vutsrqpo mnlkjihg fedcba98 76543210 2-swap 


Straightforward code is shown below. 


Click here to view code image 


hlr(x, 15); // Rotateleft 15. 
х & 0х003Е801Е) << 10 | (x & 0х01С003Е0) | 
x >> 10) & 0x003F801F; 

x & 0x0E038421) << 4 | (x & 0х11С439СЕ) | 
x 

x 


х = 


х = 


>> 4) & 0х0Е038421; 
& 0x22488842) << 2 | (x & 0x549556B5) | 


5 
( 
( 
( 
( 
( 


(x >> 2) & 0x22488842; 


An improvement in operation count, at the expense of 
parallelism, results from rewriting 


Click here to view code image 

x = (x & M) << s | (x & M2) | (x >> s) & ΜΙ; 
where w2is ~ (м1 | (м1 << s)), as: 
Click here to view code image 

Е = (x^ (x >> s)) & M; x = (t | (t<< s)) ^ x; 


This results in the code in Figure 7-3 (19 full RISC instructions, 
plus six to load constants, or 25 in all). 


Click here to view code image 


unsigned rev(unsigned x) { 
unsigned t; 


x = shlr(x, 15); // Rotateleft 15. 

Е = (x^ (x>>10)) & 0Ox003F801F; x = (t | (t<<10)) ^ x; 
Е = (x^ (x»» 4)) & 0х0Е038421, x = (t | (t<< 4)) ^ x; 
Е = (x^ (>>> 2)) & 0x22488842; x = (t| (t<< 2)) ^ x; 


return x; 


FIGURE 7-3. Reversing bits, Knuth's algorithm. 


Although Knuth's algorithm does not beat the algorithm 
shown in Figure 7—2 for reversing a 32-bit quantity with rotate 
shifts allowed (17 instructions, plus eight to load constants), 
Knuth's code uses only one rotate shift instruction. If it is coded 
as 


Click here to view code image 
x = (w<< 15) | (x >> 17); // Rotate left 15. 


then Knuth's algorithm is 21 instructions, plus six to load 
constants, which is the best found by these measures for rotating 
a 32-bit word using only basic RISC instructions. This makes one 
wonder if there is a simple way to predict the number of shifts 
and logical operations required to reverse a word of a given 


length. 


Can Knuth's algorithm be extended to reversing 64 bits on a 
64-bit machine? Yes, there is a simple way and a way that is 
more difficult to work out. The simple way is to first swap the 
two halves of the 64-bit register, and then apply the 32-bit 
version of Knuth’s algorithm to both halves, in parallel. The 
resulting code is shown in Figure 7—4. It is 24 operations, if the 
swap (rotate 32) counts as one. 


Click here to view code image 


unsigned long long rev(unsigned long long x) { 
unsigned long long t; 


x = (x<< 32) | (x >> 32); // Swap register halves. 
x = (x & OxO001FFFFOOOIFFFFLL) << 15 | // Rotate left 
(х & OxFFFEO000FFFEO000LL) >> 17; // 15. 

Е = (x^ (x >> 10)) & Ox003F801F003F801FLL; 

x = (t | (t<< 10)) ^ x; 

Е = (x^ (x >> 4)) & 0x0E0384210E038421LL; 

х = (t | (t<< 4)) ^ x; 

Е = (x^ (x >> 2)) & 0x2248884222488842LL; 

x (t (Е << 2)) ^ x; 


FIGURE 7-4. Knuth's algorithm applied to 64 bits. 


The other way is to find shift amounts and masks analogous 
to those used in Knuth's 32-bit reversal algorithm. This is shown 
below. It is 25 operations, if the rotate left shift of 31 positions 
counts as one operation. 


Click here to view code image 


unsigned long long rev(unsigned long long x) { 
unsigned long long t; 


x = (x<< 31) | (x >> 33); // I.e., shlr(x, 31). 
t = (x^ (x >> 20)) & 0x00000FFF800007FFLL; 

x = (t | (t << 20)) ^ x; 

t = (x^ (x >> 8)) & 0x00F8000F80700807LL; 

x = (t | (t<< 8) ^ x; 

t = (x^ (x >> 4)) & 0x0808708080807008LL; 

x = (t (Е << 4)) ^ x; 

Е = (x^ (x >> 2)) & 0х111111111111111111; 


x 


(t | 


return x; 


(t << 2)) 


^ 


х; 


Bit reversal can be aided by table lookup. The code that 


follows reverses a byte at a time, using a 256-byte table, and 
accumulates in reverse order the four bytes selected from the 
table. If the loop is strung out, this amounts to 13 basic RISC 
instructions, plus four loads, so it could be a winner on some 
machines. 


Click here to view code image 


unsigned rev(unsigned x) { 


) 


static unsigned char table[256] 


OxCO, 0х2 
int i; 
unsigned r; 


μία 
fo 


E 


x 
II 


0; 
( 


Ili. ме» 


(r 
x 


return r; 


0, OxAO, 


3; 1 >= 0; 
<< 8) + table[x & OxFF]; 


>> 8; 


0x60, OxEO, 


==) 


Generalized Bit Reversal 


. 


(0х00, 
OxBF, 


0x80, 0x40, 
Ox7F, OxFF}; 


[GLS1] suggests that the following sort of generalization of bit 
reversal, which he calls “flip,” is a good candidate to consider 
for a computer's instruction set: 


Click here to view code image 


LE 
>> 
LE 
>> 
if 
>> 
if 
>> 
EE 
>> 


(The last two and operations can be 
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0x55555555) 


0x33333333) 


OxOFOFOFOF) 


0х00ЕЕООЕР) 


0x0000FFFF) 
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16 


& ОхАААААААА) 


& OxCCCCCCCC) 


& OxFOFOFOFO) 


8 OxFFOOFFO00) 


8 ΟΧΕΕΕΕΌΟΟΟ) 


omitted.) For x — 31, this 


operation reverses the bits in a word. For x — 24, it reverses the 
bytes in a word. For x — 7, it reverses the bits in each byte, 
without changing the positions of the bytes. For x = 16, it 
swaps the left and right halfwords of a word, and so on. In 
general, it moves the bit at position m to position m @ К. It can 
be implemented in hardware very similarly to the way a rotate 
shifter is usually implemented (five stages of MUX's, with each 
stage controlled by a bit of the shift amount K). 


Bit-Reversing Novelties 


Item 167 in [HAK] contains rather esoteric expressions for 
reversing 6-, 7-, and 8-bit integers. Although these expressions 
are designed for a 36-bit machine, the one for reversing a 6-bit 
integer works on a 32-bit machine, and those for 7- and 8-bit 
integers work on a 64-bit machine. These expressions are as 
follows: 


6-bit: remu((x * 0x00082082) & 0x01122408, 255) 
7-bit: remu((x * 0x40100401) & 0x4 42211008, 255) 
8-bit: remu((x * 0x2 02020202) & 0x108 84422010, 1023) 


The result of all these is a *clean" integer—right-adjusted with 
no unused high-order bits set. 


In all these cases the remu function can instead be rem or 
mod, because its arguments are positive. The remainder function 
is simply summing the digits of a base-256 or base-1024 
number, much like casting out nines. Hence, it can be replaced 
with a multiply and a shift right. For example, the 6-bit formula 
has the following alternative on a 32-bit machine (the 
multiplication must be modulo 232): 


t < (x * 0x00082082) & 0x01122408 


(t » 0x01010101) — 24 


These formulas are limited in their utility, because they 
involve a remaindering operation (20 cycles or more) and/or 
some multiplications, as well as loading of large constants. The 
formula immediately above requires ten basic RISC instructions, 
two of which are multiply's, which amounts to about 20 cycles 


on a present-day RISC. On the other hand, an adaptation of the 
code of Figure 7-1 to reverse 6-bit integers requires about 15 
instructions, and probably about 9 to 15 cycles, depending on 
the amount of instruction-level parallelism in the machine. 
These techniques, however, do give compact code. Below are a 
few more techniques that might possibly be useful, all for a 32- 
bit machine. They involve a sort of double application of the 
idea from [HAK], to extend the technique to 8- and 9-bit 
integers on a 32-bit machine. 


The following is a formula for reversing an 8-bit integer: 
5 — (x * 0x02020202) & 0x84422010 


t — (x * 8) & 0x00000420 


remu(s + f, 1023) 


Here the remu cannot be changed to a multiply and shift. (You 
have to work these out, and look at the bit patterns, to see why.) 

Here is a similar formula for reversing an 8-bit integer, which 
is interesting because it can be simplified quite a bit: 


5 < (x * 0x00020202) & 0x01044010 


t — (x * 0x00080808) & 0x02088020 
remu(s + t, 4095) 


The simplifications are that the second product is just a shift 
left of the first product, the last mask can be generated from the 
second with just one instruction (shift), and the remainder can be 
replaced by a multiply and shift. It simplifies to 14 basic RISC 
instructions, two of which are multiply's: 


и < x * 0x00020202 
m < 0x0104 4010 
sc—u&m 


(< (u < 2) & (m < 1) 


(0х01001001 + (s + £)) >> 24 


The following is a formula for reversing a 9-bit integer: 
5 < (x + 0x01001001) & 0x84108010 


t < (x κ 0x00040040) 4: 0x0084 1080 


remu(s + f, 1023) 


The second multiplication can be avoided, because the 
product is equal to the first product shifted right six positions. 
The last mask is equal to the second mask shifted right eight 
positions. With these simplifications, this requires 12 basic RISC 
instructions, including the one multiply and one remainder. The 
remainder operation must be unsigned, and it cannot be changed 
to a multiply and shift. 


The reader who studies these marvels will be able to devise 
similar code for other bit-permuting operations. As a simple (and 
artificial) example, suppose it is desired to extract every other 
bit from an 8-bit quantity and compress the four bits to the 
right. That is, the desired transformation is 


Click here to view code image 


0000 0000 0000 0000 0000 0000 abcd efgh ==> 
0000 0000 0000 0000 0000 0000 0000 bdfh 


This can be computed as follows: 


t < (x * 0x01010101) & 0x40100401 


(1 * 0x08040201) > 27 


On most machines, the most practical way to do all these 
operations is by indexing into a table of 1-byte (or 9-bit) 


integers. 


Incrementing a Reversed Integer 


The Fast Fourier Transform (FFT) algorithm employs an integer i 
and its bit reversal rev(i) in a loop in which i is incremented by 
1 [PuBr]. Straightforward coding would increment i and then 
compute rev(i) on each loop iteration. For small integers, 
computing rev(i) by table lookup is fast and practical. For large 
integers, however, table lookup is not practical and, as we have 
seen, computing rev(i) requires some 29 instructions. 


If table lookup cannot be used, it is more efficient to maintain 
i in both normal and bit-reversed forms, incrementing them both 
on each loop iteration. This raises the question of how best to 
increment an integer that is in a register in reversed form. To 
illustrate, on a 4-bit machine we wish to successively step 
through the values (in hexadecimal) 


Click here to view code image 


0, 8, 4, C, 2, А, 6, E, 1, 9, 5, D, 3, B, 7, Е. 


In the FFT algorithm, i and its reversal are both some specific 
number of bits in length, almost certainly less than 32, and they 
are both right-justified in the register. However, we assume here 
that i is a 32-bit integer. After adding 1 to the reversed 32-bit 
integer, a shift right of the appropriate number of bits will make 
the result usable by the FFT algorithm (both i and rev(i) are 
used to index an array in memory). 


The straightforward way to increment a reversed integer is to 
scan from the left for the first O-bit, set it to 1, and set all bits to 
the left of it (if any) to 0’s. One way to code this is 


Click here to view code image 
unsigned x, m; 


m = 0x80000000; 


^ 


X = κ m; 
if ((int)x >= 0) { 
do { 
m =m >> 1; 
x = х^ m; 


) while (x < m); 


This executes in three basic RISC instructions if x begins with 
a O-bit, and four additional instructions for each loop iteration. 
Because x begins with a O-bit half the time, with 10 (binary) 
one-fourth of the time, and so on, the average number of 
instructions executed is approximately 


3.447.144). h4 15.14 
2 4 8 | 
— M РИ ТРГ Е р =] 
2 4 8 16 
ο... 1-1 
2 4 8 16 
= 7, 


In the second line we added and subtracted 1, with the first 1 
in the form 1/2 + 1/4 + 1/8 + 1/16 + .... This makes the 
series similar to the one analyzed on page 113. The number of 
instructions executed in the worst case, however, is quite large 
(131). 


If number of leading zeros is available, adding 1 to a reversed 
integer can be done as follows: 


First execute: s «— nlz(x) 
апа then either: x < x (0х80000000 = $) 


or x < ((x << s) + 0х80000000) >> s 


Either method requires five full RISC instructions and, to 
properly wrap around from OxFFFFFFFF to O, requires that the 
shifts be modulo 64. (These formulas fail in this respect on the 
Intel x86 machines, because the shifts are modulo 32.) 


The rather puzzling one-liner below [Mobi] increments a 
reversed integer in six basic RISC instructions. It is free of 
branches and loads but includes an integer division operation. It 
works for integers of length up to that of the word size of the 
machine, less 1. 


revi < revi © (m - м 
(i 96 (i+1))+1 

To use this, both the non-reversed integer i and its reversal 
reyi must be available. The variable m is the modulus; if we are 
dealing with n-bit integers, then т = 2". Applying the formula 
gives the next value of the reversed integer. The non-reversed 
integer i would be incremented separately. The reversed integer 
is incremented *in place"; that is, it is not shifted to the high- 
order end of the register, as in the two preceding methods. 


A variation is 


revi revi (т), (1) 
— & (1+1 

which executes in five instructions if the machine has and not, 
and if m is a constant so that the calculation of m / 2 does not 
count. It works for integers of length up to that of the word size 
of the machine. (For full word-size integers, use O for the first 
occurrence of m in the formula, and 21-1 for m / 2.) 


7-2 Shuffling Bits 


Another important permutation of the bits of a word is the 
“perfect shuffle” operation, which has applications іп 
cryptography. There are two varieties, called the “outer” and 
“inner” perfect shuffles. They both interleave the bits in the two 
halves of a word in a manner similar to a perfect shuffle of a 
deck of 32 cards, but they differ in which card is allowed to fall 
first. In the outer perfect shuffle, the outer (end) bits remain in 
the outer positions, and in the inner perfect shuffle, bit 15 moves 
to the left end of the word (position 31). If the 32-bit word is 
(where each letter denotes a single bit) 


Click here to view code image 


abcd efgh ijkl mnop ABCD EFGH IJKL MNOP, 


then after the outer perfect shuffle it is 
Click here to view code image 


aAbB cCdD eEfF gGhH iIjJ kK1L mMnN oOpP, 


and after the inner perfect shuffle it is 
Click here to view code image 


AaBb CcDd EeFf GgHh IiJj KkL1 MmNn ОоРр. 


Assume the word size W is a power of 2. Then the outer 
perfect shuffle operation can be accomplished with basic RISC 
instructions in logo(W / 2) steps, where each step swaps the 
second and third quartiles of successively smaller pieces [GLS1]. 
That is, a 32-bit word is transformed as follows: 


Click here to view code image 


abcd efgh ijkl mnop ABCD EFGH IJKL MNOP 
abcd efgh ABCD EFGH ijkl mnop IJKL MNOP 
abcd ABCD efgh EFGH ijkl IJKL mnop MNOP 
арАВ cdCD efEF ghGH ijIJ КІКІ mnMN opOP 
aAbB cCdD eEfF gGhH iljJ kK1L mMnN oOpP 


Straightforward code for this is 


Click here to view code image 


x = (x & 0х0000ЕЕ00) << 8 (x >> 8) & 0х0000ЕЕ00 X & 
OxFFOO00FF; 
x = (x & ΟΧΟΟΕΟΟΟΕΟ) << 4 (x >> 4) & 0х00Е000Е0 X & 
OxFOOFFOOF; 
x = (x & OxOCOCOCOC) << 2 (x >> 2) & 0х0С0С0С0С х 6 
0xC3C3C3C3; 
х = (x & 0x22222222) << 1 (x >> 1) & 0x22222222 X & 
0x99999999; 


which requires 42 basic RISC instructions. This can be 
reduced to 30 instructions, although at an increase from 17 to 
21 cycles on a machine with unlimited instruction-level 
parallelism, by using the exclusive or method of exchanging two 
fields of a register (described on page 47). All quantities are 
unsigned: 


Click here to view code image 


Е = (x^ (x >> ϐ)) 6 Ox0000FF00; x "x^ t^ (t<< 8); 
Е = (x^ (x >> 4)) & Ox00F000F0; x = x ^t ^ (t<< 4); 
Е = (x^ (x >> 2)) & OxOCOCOCOC; x =x ^ t ^ (t<< 2); 
t= (x^ (x >> 1)) & 0x22222222; x =x ^t^ (t << 1); 


The inverse operation, the outer unshuffle, is easily 
accomplished by performing the swaps in reverse order: 


Click here to view code image 


t (x ^ (x >> 1)) & 0x22222222; x "x ^t^ (t << 1); 
Е = (x^ (x >> 2)) & OxOCOCOCOC; x^x^t^ (t<< 2); 
Е = (x^ (x >> 4)) & OxOOFOOOFO; x =x ^t ^ (t<< 4); 
t (x ^ (x >> 8)) & 0х0000ЕЕ00,: x "x^ t^ (t<< 8); 


Using only the last two steps of either of the above two 
shuffle sequences shuffles the bits of each byte separately. Using 
only the last three steps shuffles the bits of each halfword 
separately, and so on. Similar remarks apply to unshuffling, 
except by using the first two or three steps. 


To get the inner perfect shuffle, prepend to these sequences a 
step to swap the left and right halves of the register: 
Click here to view code image 

x = (x >> 16) | (x<< 16); 
(or use a rotate of 16 bit positions). The unshuffle sequence can 
be similarly modified by appending this line of code. 


Altering the transformation to swap the first and fourth 
quartiles of successively smaller pieces produces the bit reversal 
of the inner perfect shuffle. 


Perhaps worth mentioning is the special case in which the left 
half of the word x is all 0. In other words, we want to move the 
bits in the right half of x to every other bit position—that is, to 
transform the 32-bit word 


Click here to view code image 


0000 0000 0000 0000 ABCD EFGH IJKL MNOP 


to 
Click here to view code image 


OAOB OCOD OEOF ОСОН OIOJ OKOL OMON OOOP. 


The outer perfect shuffle code can be simplified to do this 
task in 22 basic RISC instructions. The code below, however, 


does it in only 19, at no cost in execution time on a machine 
with unlimited instruction-level parallelism (12 cycles with 
either method). This code does not require that the left half of 
word x be initially cleared. 


Click here to view code image 


x ((x & OxFFOO) << 8) | (x & OxOOFF); 
x = ((x<< 4) | x) & OxOFOFOFOF; 
x = ((x<< 2) | x) & 0x33333333; 
x = ((x<< 1) | x) & 0x55555555; 


Similarly, for the inverse of this “half shuffle" operation (a 
special case of compress; see page 150), the outer perfect 
unshuffle code can be simplified to do the task in 26 or 29 basic 
RISC instructions, depending on whether or not an initial and 
operation is required to clear the bits in the odd positions. The 
code below, however, does it in only 18 or 21 basic RISC 
instructions, and with less execution time on a machine with 
unlimited instruction-level parallelism (12 or 15 cycles). 


Click here to view code image 


х = x & 0x55555555; // (If required.) 
x = ((x >> 1) | x) & 0x33333333; 
x = ((x >> 2) | x) & OxOFOFOFOF; 
x = ((x >> 4) | x) & OxOOFFOOFF; 
x = ((x >> 8) | х) & 0х0000ЕЕЕР, 


7-3 Transposing a Bit Matrix 


The transpose of a matrix A is a matrix whose columns are the 
rows of A and whose rows are the columns of A. Here we 
consider the problem of computing the transpose of a bit matrix 
whose elements are single bits that are packed eight per byte, 
with rows and columns beginning on byte boundaries. This 
seemingly simple transformation is surprisingly costly in 
instructions executed. 


On most machines it would be very slow to load and store 
individual bits, mainly due to the code that would be required to 
extract and (worse yet) to store individual bits. A better method 
is to partition the matrix into 8 X 8 submatrices, load each 8x8 
submatrix into registers, compute the transpose of the submatrix 
in registers, and then store the 8x8 result in the appropriate 
place in the target matrix. Figure 7-5 illustrates the 


transposition of a bit matrix of size 2х3 bytes. A, B, ..., Е are 
submatrices of size 8 x8 bits. AT, BT, ... denote the transpose of 
submatrices A, B, .... 


m=2 


FIGURE 7-5. Transposing a 16 x 24-bit matrix. 


For the purposes of transposing an 8 x 8 submatrix, it doesn't 
matter whether the bit matrix is stored in row-major or column- 
major order; the operations are the same in either event. Assume 
for discussion that it’s in row-major order. Then the first byte of 
the matrix contains the top row of A, the next byte contains the 
top row of B, and so on. If L denotes the address of the first byte 
(top row) of a submatrix, then successive rows of the submatrix 
are at locations L + n, L + 2n, ..., L + 7n. 


For this problem we will depart from the usual assumption of 
a 32-bit machine and assume the machine has 64-bit general 
registers. The algorithms are simpler and more easily understood 
in this way, and it is not difficult to convert them for execution 
on a 32-bit machine. In fact, a compiler that supports 64-bit 
integer operations on a 32-bit machine will do the work for you 
(although probably not as effectively as you can do by hand). 


The overall scheme is to load a submatrix with eight load byte 
instructions and pack the bytes left-to-right into a 64-bit register. 
Then the transpose of the register’s contents is computed. 
Finally, the result is stored in the target area with eight store byte 
instructions. 


The transposition of an 8х8 bit matrix is illustrated here, 
where each character represents a single bit. 


0123 4567 08go wEMU 


89ab cdef 19hp xFNV 
ghij klmn 2aiq yGOW 
opqr stuv =» 3bjr zHPX 
wxyz ABCD 4cks AIQY 
EFGH IJKL 5dlt BJRZ 
MNOP QRST бети СК55 
UVWX YZS. 7fnv DLT. 


In terms of doublewords, the transformation to be done is to 
change the first line to the second line below. 


Click here to view code image 


01234567 89abcdef ghijklmn opqrstuv wxyzABCD EFGHIJKL 
MNOPORST UVWXYZS. 
08g0wEMU 19hpxFNV 2aiqyGOW 3bjrzHPX 4cksAIQY 5dltBJRZ 
6emuCKS$ 7fnvDLT. 


Notice that the bit denoted by 1 moves seven positions to the 
right, the bit denoted by 2 moves 14 positions to the right, and 
the bit denoted by 8 moves seven positions to the left. Every bit 
moves O0, 7, 14, 21, 28, 35, 42, or 49 positions to the left or 
right. Since there are 56 bits in the doubleword that have to be 
moved and only 14 different nonzero movement amounts, an 
average of about four bits can be moved at once, with 
appropriate masking and shifting. Straightforward code for this 
follows. 


Click here to view code image 


y "x & 0x8040201008040201LL | 
(x & 0x0080402010080402LL) << 7 
(x & 0x0000804020100804LL) << 14 
(x & 0x0000008040201008LL) «« 21 
(x & 0x0000000080402010LL) << 28 
(x & 0x0000000000804020LL) << 35 
(x & 0x0000000000008040LL) << 42 
(х & 0х000000000000008011) << 49 
(x >> 7) & 0x0080402010080402LL | 
(x >> 14) & 0x0000804020100804LL | 
(х >> 21) & 0x0000008040201008LL | 
(x >> 28) 8 0x0000000080402010LL | 
(x >> 35) & 0x0000000000804020LL | 
(x »» 42) & 0x0000000000008040LL | 


(x >> 49) & 0x0000000000000080LL; 


This executes in 43 instructions on the basic RISC, exclusive 
of mask generation (which is not important in the application of 
transposing a large bit matrix, because the masks are loop 
constants). Rotate shifts do not help. Some of the terms are of 
the form (x & mask)«« s, and some are of the form (x »» s)& 
mask. This reduces the number of masks required; the last seven 
are repeats of earlier masks. Notice that each mask after the first 
can be generated from the first with one shift right instruction. 
Because of this, it is a simple matter to write a more compact 
version of the code that uses a for-loop that is executed seven 
times. 


Another variation is to employ Steele's method of using 
exclusive or to swap bit fields (described on page 47). That 
technique does not help much in this application. It results in a 
function that executes in 42 instructions, exclusive of mask 
generation. The code starts out 


Click here to view code image 


t 
x 


(x ^ (x >> 7)) & 0x0080402010080402LL; 
х ^t^ (t<< 7); 


and there are seven such pairs of lines. 


Although there does not seem to be a really great algorithm 
for this problem, the method to be described beats the 
straightforward method and its variations described above by 
approximately a factor of 2 on the basic RISC, for the calculation 
part (not counting loading and storing the submatrices or 
generating masks). The method gets its power from its high level 
of bit-parallelism. It would not be a good method if the matrix 
elements are words. For that, you can't do better than loading 
each word and storing it where it goes. 


First, treat the 8 x 8-bit matrix as 16 2x2-bit matrices and 
transpose each of the 16 2 x 2-bit matrices. Then treat the matrix 
as four 2x 2 submatrices whose elements are 2 x 2-bit matrices 
and transpose each of the four 2х2 submatrices. Finally, treat 
the matrix as a 2x2 matrix whose elements are 4х 4-Ьі 
matrices and transpose the 2х2 matrix. These transformations 
are illustrated below [Floyd]. 


0123 4567 082a 4c6e 08go 4cks 08go wEMU 


89ab cdef 193b 5d7f 19hp 5dlt 19hp xFNV 
ghij klmn goiq ksmu 2aiq 6emu 2aiq yGOW 
opqr stuv > hpjr ltnv > 3bjr 7fnv > 3bjr zHPX 
wxyz ABCD wEyG AICK wEMU AIQY 4cks AIQY 
EFGH IJKL xFZH BJDL ХЕМУ BJRZ 5dlt BJRZ 
MNOP QRST MUOW QYS$ yGOW CKS$ 6emu CKS$ 
UVWX YZS. NVPX RZT. zHPX DLT. 7fnv DLT. 


A complete procedure is shown in Figure 7-6. Parameter a is 
the address of the first byte of an 8 x 8 submatrix of the source 
matrix, and parameter в is the address of the first byte of an 
8 x 8 submatrix in the target matrix. 


The calculation part of this function executes in 21 
instructions. Each of the three major steps is swapping bits, so a 
version can be written that uses the Steele exclusive or bit field 
swapping device. Using it, the first assignment to x in Figure 7-6 
becomes: 


Click here to view code image 


t = (x^ (x >> 7)) & ΟΧΟΟΑΔΟΟΑΔΟΟΔΑΟΟΑΑΙΙ; 
= * Б * (ESS: 7); 


The calculation part of the revised function executes in only 18 
instructions, but it has no instruction-level parallelism. 


The algorithm of Figure 7-6 runs from fine to coarse 
granularity, based on the lengths of the groups of bits that are 
swapped. The method can also be run from coarse to fine 
granularity. To do this, first treat the 8 x 8-bit matrix as a 2x2 
matrix whose elements are 4 x 4-bit matrices and transpose the 
2x2 matrix. Then, treat each of the four 4x 4 submatrices as a 
2x2 matri whose elements are 2*x2-bit matrices, and 
transpose each of the four 2x 2 submatrices, and so forth. The 
code for this is the same as that of Figure 7-6 except for the 
three assignments that do the bit rearranging being run in 
reverse order. 


Click here to view code image 


void transpose8 (unsigned char A[8], int m, int n, 
unsigned char B[8]) { 
unsigned long long x; 


int i; 


for (i = 0; i <= 7; i++) // Load 8 bytes from the 
x = x<< 8 | A[m*i]; // input array and pack 
// them into x. 


x = X 8 OxAA55AA55AA55AA55LL 
(x & Ox00AA00AAO00AAO0AALL) << 7 
(x >> 7) & ОхООААООААООААООААІ1; 
x = x & OxCCCC3333CCCC3333LL 
(x & Ox0000CCCCOO000CCCCLL) << 14 
(x >> 14) & 0x0O000CCCCOO00CCCCLL; 
x — x & OxFOFOFOFOOFOFOFOFLL 
(x & 0x00000000FO0FOFOFOLL) << 28 
(х >> 28) & 0x00000000FO0FOFOFOLL; 


; i >= 0; i--) { // Store result into 
= x >> 8;} // output array B. 


FIGURE 7-6. Transposing an 8 x 8-bit matrix. 


As was mentioned, these functions can be modified for 
execution on a 32-bit machine by using two registers for each 
64-bit quantity. If this is done and any calculations that would 
result in zero are used to make obvious simplifications, the 
results are that a 32-bit version of the straightforward method 
described on page 143 runs in 74 instructions (compared to 43 
on a 64-bit machine), and a 32-bit version of the function of 
Figure 7-6 runs in 36 instructions (compared to 21 on a 64-bit 
machine). Using Steele’s bit-swapping technique gives a 
reduction in instructions executed at the expense of instruction- 
level parallelism, as in the case of a 64-bit machine. 


Transposing a 32 x 32-Bit Matrix 


The same recursive technique that was used for the 8 х 8-bit 
matrix can be used for larger matrices. For a 32x 32-bit matrix 
it takes five stages. 


The details are quite different from Figure 7—6, because here 
we assume that the entire 32 x 32-bit matrix does not fit in the 
general register space, and we seek a compact procedure that 
indexes the appropriate words of the bit matrix to do the bit 
swaps. The algorithm to be described works best if run from 
coarse to fine granularity. 


In the first stage, treat the matrix as four 16 x 16-bit matrices, 
and transform it as follows: 


A В| (4 C]. 


Ср BD 


A denotes the left half of the first 16 words of the matrix, B 
denotes the right half of the first 16 words, and so on. It should 
be clear that the above transformation can be accomplished by 
the following swaps: 


Right half of word 0 with the left half of word 16, 
Right half of word 1 with the left half of word 17, 


Right half of word 15 with the left half of word 31. 


To implement this in code, we will have an index k that ranges 
from 0 to 15. In a loop controlled by К, the right half of word К 
will be swapped with the left half of word k + 16. 


In the second stage, treat the matrix as 16 8X 8-bit matrices, 
and transform it as follows: 


ABCD| |AECG 
E FG H| |B F D H|. 
IJKL |IMKO 
IMNOP| [УМЕР 


This transformation can be accomplished by the following 
swaps: 


Bits OxOOFFOOFF of word 0 with bits OxFFOOFFOO of word 8, 
Bits OxOOFFOOFF of word 1 with bits OxFFOOFFOO of word 9, 
and so on. 


This means that bits 0—7 (the least significant eight bits) of word 
0 are swapped with bits 8-15 of word 8, and so on. The indexes 
of the first word in these swaps are k = 0, 1, 2, 3, 4, 5, 6, 7, 16, 
17, 18, 19, 20, 21, 22, 23. A way to step k through these values 
is 


№ = (k+ 9) & ^8. 


In the loop controlled by k, bits of word k are swapped with bits 
of word k + 8. 


Similarly, the third stage does the following swaps: 


Bits OXOFOFOFOF of word 0 with bits OXFOFOFOFO of word 4, 
Bits OxOFOFOFOF of word 1 with bits OXFOFOFOFO of word 5, 
and so on. 


The indexes of the first word in these swaps are k = 0, 1, 2, 3, 
8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27. A way to step k 
through these values is 


ki = (k+ 5) & —4. 


In the loop controlled by К, bits of word К are swapped with bits 
of word k 4 4. 


These considerations are coded rather compactly in the C 
function shown in Figure 7-7 [GLS1]. The outer loop controls 
the five stages, with ; taking on the values 16, 8, 4, 2, and 1. It 
also steps the mask m through the values ΟΧΟΟΟΟΕΕΕΕ, 
OxOOFFOOFF, OxOFOFOFOF, 0x33333333, and 0x55555555. (The 
code for this, m = m ^ (m<< 3), is a nice little trick. It does not 
have an inverse, which is the main reason this code works best 
for coarse to fine transformations.) The inner loop steps k 
through the values described above. The inner loop body swaps 
the bits of aix] identified by mask m with the bits of a[k+3] 
shifted right j and identified by m, which is equivalent to the 
bits of a[k+31] identified with the complement of m. The code for 
performing these swaps is an adaptation of the "three exclusive 
or" technique shown on page 46 column (c). 


Click here to view code image 

void transpose32 (unsigned A[32]) { 
int j, k; 
unsigned m, t; 


m = ΟΧΟΟΟΟΕΕΕῈ; 


for (j = 16; j != 0; j = j >> 1, m = m “ (m<< j)) { 
for (k = 0; k < 32; k = (k + j + 1) & —3) Í 
Е = (A[k] ^ (A[k*j] >> j)) & m; 


A[k] = A[k] ^ t; 
A[k*j] = A[k+j] ^ (t<< j); 


FIGURE 7-7. Compact code for transposing a 32 x 32-bit 
matrix. 


Based on compiling this function with the GNU C compiler to 
a machine very similar to the basic RISC, this compiles into 31 
instructions, with 20 in the inner loop, and 7 in the outer loop 
but not in the inner loop. Thus, it executes in 4 + 5(7 + 16- 
20) = 1639 instructions. In contrast, if this function were 
performed using 16 calls on the 8х8 transpose program of 
Figure 7-6 (modified to run on a 32-bit machine), then it would 
take 16(101 + 5) = 1696 instructions, assuming the 16 calls 
are “strung out.” This includes five instructions for each function 
call (observed in compiled code). Therefore, the two methods 
are, on the surface anyway, very nearly equal in execution time. 


On the other hand, for a 64-bit machine the code of Figure 7— 
7 can easily be modified to transpose a 64 x 64-bit matrix, and it 
would take about 4 + 6(7 + 32 · 20) = 3886 instructions. 
Doing the job with 64 executions of the 8 x transpose method 
would take about 64(85 + 5) = 5760 instructions. 


The algorithm works in place, and thus if it is used to 
transpose a larger matrix, additional steps are required to move 
32 X 32-bit submatrices. It can be made to put the result matrix 
in an area distinct from the source matrix by separating out 
either the first or last execution of the “for j-loop” and having it 
store the result in the other area. 


About half the instructions executed by the function of Figure 
7-7 are for loop control, and the function loads and stores the 
entire matrix five times. Would it be reasonable to reduce this 
overhead by unrolling the loops? It would, if you are looking for 
the ultimate in speed, if memory space is not a problem, if your 
machine’s I-fetching can keep up with a large block of straight- 
line code, and especially if the branches or loads are costly in 
execution time. The bulk of the program will be the six 
instructions that do the bit swaps repeated 80 times (5 · 16). In 
addition, the program will need 32 load instructions to load the 
source matrix and 32 store instructions to store the result, for a 
total of at least 544 instructions. 


Figure 7-8 outlines a program in which the unrolling is done 
by hand. This program is shown as not working in place, but it 
executes correctly in place, if that is desired, by invoking it with 
identical arguments. The number of *swap" lines is 80. Our GNU 
C compiler for the basic RISC machine compiles this into 576 
instructions (branch-free, except for the function return), 
counting prologs and epilogs. This machine does not have the 
store multiple and load multiple instructions, but it can save and 
restore registers two at a time with store double and load double 
instructions. 


Click here to view code image 


#define swap(a0, al, j, m) t = (a0 ^ (al >>ј)) ἃ m; N 
а0 = a0 ^ t; N 
81 = al ^ (+ << j); 


void transpose32 (unsigned A[32], unsigned B[32]) { 
unsigned m, t; 
unsigned a0, al, a2, a3, a4, a5, аб, a7, 
a8, a9, а10, all, a12, a13, а14, a15, 
а16, а17, a18, a19, a20, a21, a22, a23, 
a24, a25, a26, a27, a28, a29, a30, a31; 


a0 = ΑΙ 0]; al = ΔΙ 1]; a2 a3 = АГ 3]; 
a4 = A[ 4]; a5 = A[ 5]; аб = A[ 6]; a7 «АГ 7]; 


Ш 
p 
N 


a28 = A[28]; a29 = A[29]; a30 = A[30]; a31 = A[31]; 


m = 0x0000FFFF; 

swap (a0, al6, 16, m) 
swap (al, al7, 16, m) 
swap(al5, a31, 16, m) 
m = OxOOFFOOFF; 
swap(a0, a8, 8, m) 
swap(al, a9, 8, m) 


swap(a28, a29, 1, m) 
swap(a30, a31, 1, m) 


B[ 0] » a0; B[ 1] = al; B[ 2] = a2; B[ 3] = a3; 
B[ 4] = a4; ΒΓ 5] = a5; B[ 6] = a6; B[ 7] = a7; 


В[28] = a28; B[29] = a29; B[30] = a30; B[31] = 831; 


FIGURE 7-8. Straight-line code for transposing a 32 x 32-bit 
matrix. 


There is a way to squeeze a little more performance out of 
this if your machine has a rotate shift instruction (either left or 
right). The idea is to replace all the swap operations of Figure 7- 
8, which take six instructions each, with simpler swaps that do 
not involve a shift, which take four instructions each (use the 
swap macro given, with the shifts omitted). 


First, rotate right words A[16..31] (that is, A[k] for 16 < k 
< 131) by 16 bit positions. Second, swap the right halves of 
A[0] with A[16], A[1] with A[17], and so on, similarly to the 
code of Figure 7-8. Third, rotate right words A[0..8] and 
A[24..31] by eight bit positions, and then swap the bits 
indicated by a mask of OxOOFFOOFF in words A[0] and A[8], 
A[1] and A[9], and so on, as in the code of Figure 7-8. After five 
stages of this, you don't quite have the transpose. Finally, you 
have to rotate left word A[1] by one bit position, A[2] by two 
bit positions, and so on (31 instructions). We do not show the 
code, but the steps are illustrated below for a 4 x 4-bit matrix. 


abcd abcd abij abij aeim aeim 
efgh шаг efgh T efmn == nefm — nbfj — bfjn 
ijkl klij klcd klcd kocg cgko 
mnop opmn opgh hopg hlpd dhlp 


The bit-rearranging part of the program of Figure 7-8 
requires 480 instructions (80 swaps at six instructions each). The 
revised program, using rotate instructions, requires 80 swaps at 
four instructions each, plus 80 rotate instructions (16 · 5) for the 
first five stages, plus a final 31 rotate instructions, for a total of 
431 instructions. The prolog and epilog code would be 
unchanged, so using rotate instructions in this way saves 49 
instructions. 


There is another quite different method of transposing a bit 
matrix: apply three shearing transformations [GLS1]. If the 
matrix is nx n, the steps are (1) rotate row i to the right i bit 
positions, (2) rotate column j upwards (j + 1) mod n bit 
positions, (3) rotate row i to the right (i + 1) mod n bit 
positions, and (4) reflect the matrix about a horizontal axis 


through the midpoint. To illustrate, for a 4 x 4-bit matrix: 


abcd abcd hlpd dhlp aeim 
efgh юэ? hefg Е kocg =. сако = bfjn 
ijkl klij nbfj bfjn cgko 
mnop nopm aeim aeim dhlp 


This method is not quite competitive with the others, because 
step (2) is costly. (To do it at reasonable cost, rotate upward all 
columns that rotate by n/2 or more bit positions by n / 2 bit 
positions [these are columns n / 2 - 1 through n-2], then rotate 
certain columns upward n / 4 bit positions, and so on.) Steps 1 
and 3 require only n - 1 instructions each, and step 4 requires 
no instructions at all if the results are simply stored to the 
appropriate locations. 


If an 8 x 8-bit matrix is stored in a 64-bit word in the obvious 
way (top row in the most significant eight bits, and so on), then 
the matrix transpose operation is equivalent to three outer 
perfect shuffles or unshuffles [GLS1]. This is a very good way to 
do it if your machine has shuffle or unshuffle as a single 
instruction, but it is not a good method on a basic RISC machine. 


7-4 Compress, or Generalized Extract 


The APL language includes an operation called compress, written 
B/V, where B is a Boolean vector and V is vector of the same 
length as B, with arbitrary elements. The result of the operation 
is a vector consisting of the elements of V for which the 
corresponding bit in B is 1. The length of the result vector is 
equal to the number of 1’s in B. 


Here we consider a similar operation on the bits of a word. 
Given a mask m and a word x, the bits of x for which the 
corresponding mask bit is 1 are selected and moved 
(“compressed”) to the right. For example, if the word to be 
compressed is (where each letter denotes a single bit) 


Click here to view code image 


abcd efgh ijkl mnop qrst uvwx yzAB CDEF. 


and the mask is 


Click here to view code image 


0000 1111 0011 0011 1010 1010 0101 0101, 


then the result is 
Click here to view code image 
0000 0000 0000 0000 efgh klop qsuw zBDF. 
This operation might also be called generalized extract, by 


analogy with the extract instruction found on many computers. 


We are interested in code for this operation with minimum 
worst-case execution time, and offer the simple loop of Figure 7— 
9 as a straw man to be improved upon. This code has no 
branches in the loop, and it executes in 260 instructions worst 
case, including the subroutine prolog and epilog. 


Click here to view code image 


unsigned compress(unsigned x, unsigned m) { 


unsigned r, s, b; // Result, shift, mask bit. 
r= 0; 
s = 0; 
do { 
b =m α 1; 
r= r | ((х & b) << s); 
s = s tb; 
x = x >> 1; 
m m >> 1; 
) while (m != 0); 


return r; 


FIGURE 7-9. A simple loop for the compress operation. 


It is possible to improve on this by repeatedly using the 
parallel suffix method (see page 97) with the exclusive or 
operation [GLS1]. We will denote the parallel suffix operation 
by PS-XOR. The basic idea is to first identify the bits of 
argument x that are to be moved right an odd number of bit 
positions, and move those. (This operation is simplified if x is 
first anded with the mask, to clear out irrelevant bits.) Mask bits 
are moved in the same way. Next, we identify the bits of x that 
are to be moved an odd multiple of 2 positions (2, 6, 10, and so 
on), and then we move these bits of x and the mask. Next, we 


identify and move the bits that are to be moved an odd multiple 
of 4 positions, then those that move an odd multiple of 8, and 
then those that move 16 bit positions. 


Because this algorithm, believed to be original with [GLS1], 
is a bit difficult to understand, and because it is perhaps 
surprising that something along these lines can be done at all, 
we will describe its operation in some detail. Suppose the inputs 
are 


Click here to view code image 


x — abcd efgh ijkl mnop qrst uvwx yzAB CDEF, 


m = 1000 1000 1110 0000 0000 1111 0101 0101, 
1 1 TI 
9 6 333 4444 32 10 


where each letter in x represents a single bit (with value O or 1). 
The numbers below each 1-bit in the mask m denote how far the 
corresponding bit of x must move to the right. This is the 
number of 0’s in n to the right of the bit. As mentioned above, it 
is convenient to first clear out the irrelevant bits of x, giving 


Click here to view code image 
x = a000 6000 ijkO 0000 0000 uvwx 0208 ODOF. 


The plan is to first determine which bits move an odd number 
of positions (to the right), and move those one bit position. 
Recall that the PS-XOR operation results in a 1-bit at each 
position where the number of 1’s at and to the right of that 
position is odd. We wish to identify those bits for which the 
number of 0’s strictly to the right is odd. This can be done by 
computing mk = ~m << 1 and performing PS-XOR on the result. 
This gives 


Click here to view code image 


mk = 1110 1110 0011 1111 1110 0001 0101 0100, 
mp = 1010 0101 1110 1010 1010 0000 1100 1100. 


Observe that mk identifies the bits of m that have a O 
immediately to the right, and mp sums these, modulo 2, from the 
right. Thus, mp identifies the bits of m that have an odd number 
of 0’s to the right. 


The bits that will be moved one position are those that are in 
positions that have an odd number of 0’s strictly to the right 
(identified by mp) and that have a 1-bit in the original mask. 
This is simply mv = mp & m: 


Click here to view code image 
mv = 1000 0000 1110 0000 0000 0000 0100 0100. 
These bits of m can be moved with the assignment 
Click here to view code image 
mv) | (mv >> 1); 
and the same bits of x can be moved with the two assignments 
Click here to view code image 


js 
x 


x & mv; 
(x ^t) | (t >> 1); 


(Moving the bits of m is simpler because all the selected bits 
are 1°.) Here the exclusive or is turning off bits known to be 1 in 
m and x, and the or is turning on bits known to be 0 in mand x. 
The operations could also, alternatively, both be exclusive or, or 
subtract and add, respectively. The results, after moving the bits 
selected by nv right one position, are: 


Click here to view code image 


0100 1000 0111 0000 0000 1111 0011 0011, 
0800 e000 Oijk 0000 0000 uvwx 0028 OODF. 


m 
x 


Now we must prepare a mask for the second iteration, in 
which we identify bits that are to move an odd multiple of 2 
positions to the right. Notice that the quantity mk « ~mp 
identifies those bits that have a 0 immediately to the right in the 
original mask m, and those bits that have an even number of 0’s 
to the right in the original mask. These properties apply jointly, 
although not individually, to the revised mask m. (That is to say, 
mk identifies all the positions in the revised mask m that have a 0 
to the immediate right and an even number of 0’s to the right.) 
This is the quantity that, if summed from the right with PS-XOR, 
identifies those bits that move to the right an odd multiple of 2 
positions (2, 6, 10, and so on). Therefore, the procedure is to 


assign this quantity to mx and perform a second iteration of the 
above steps. The revised value of пк is 


Click here to view code image 
mk - 0100 1010 0001 0101 0100 0001 0001 0000. 


A complete C function for this operation is shown in Figure 
7-10. It does the job in 127 basic RISC instructions (constant)1, 
including the subroutine prolog and epilog. Figure 7-11 shows 
the sequence of values taken on by certain variables at key 
points in the computation, with the same inputs that were used 
in the discussion above. Observe that a by-product of the 
algorithm, in the last value assigned to m, is the original m with 
all its 1-bits compressed to the right. 


Click here to view code image 


unsigned compress(unsigned x, unsigned m) { 
unsigned mk, mp, mv, t; 


int i; 
x = x & m; // Clear irrelevant bits. 
mk = ~m<< 1; // We will count 0's to right. 


for (i = 0; i < 57 att) 4 
mp = mk ^ (mk<< 1); // Parallel suffix. 
mp = mp ^ (mp<< 2); 
mp = mp ^ (mp<< 4); 
mp = mp ^ (mp<< 8); 


mp = mp ^ (mp<< 16); 
mv = mp & m; // Bits to move. 
т = ш ^ mv | (mv >> (1<< i)); // Compress m. 
t = x & mv; 
=x^t | (t >> (1<< i)); // Compress x. 


) 


return x; 


FIGURE 7-10. Parallel suffix method for the compress 
operation. 


We calculate that the algorithm of Figure 7-10 would execute 
in 169 instructions on a 64-bit basic RISC, as compared to 516 


(worst case) for the algorithm of Figure 7-9. 


The number of instructions required by the algorithm of 
Figure 7-10 can be reduced substantially if the mask m is a 
constant. This can occur in two situations: (1) a call to 
“compress (х, m)" occurs in a loop, in which the value of m is 
not known, but it is a loop constant, and (2) the value of m is 
known, and the code for compress is generated in advance, 
perhaps by a compiler. 


Notice that the value assigned to x in the loop in Figure 7-10 
is not used in the loop for anything other than the assignment to 
x. And x is dependent only on itself and variable mv. Therefore, 
the subroutine can be coded with all references to x deleted, 
and the five values computed for mv can be saved in variables 
туб, .., mv4. Then, in situation (1) the function without 
references to x can be placed outside the loop in which 
"compress(x, m)" occurs, and the following statements can be 
placed in the loop: 


την], 


Click here to view code image 


X = x & m; 
t=x & mv0O; х= х^ | (t >> 1); 
t = x & mvl; x = x ^t | (t >> 2); 
t = x &mv2; x = x “ t | (t >> 4); 
t = x & mv3; x = x “ t | (t >> 8); 
Е = х а mv4; x = х^ t | (t >> 16); 
This is only 21 instructions in the loop (the loading of the 


constants can be placed outside the loop), a considerable 
improvement over the 127 required by the full subroutine of 


Figure 7-10. 


Click here to view code image 


x = abcd 

m = 1000 

x = a000 

T. = ος mk = 1110 
After PS, mp = 1010 
mv = 1000 

m = 0100 

x = 0a00 


efgh 
1000 
e000 


1110 
0101 
0000 
1000 
e000 


ijkl 
1110 
410 


0011 
1110 
1110 
0111 
0ijk 


mnop 
0000 
0000 


1111 
1010 
0000 
0000 
0000 


qrst 
0000 
0000 


1110 
1010 
0000 
0000 
0000 


UVWwX 
1111 
UVWX 


0001 
0000 
0000 
1111 
uvwx 


yzAB 
0101 
0z0B 


0101 
1100 
0100 
0011 
00zB 


CDEF 
0101 
ODOF 


0100 
1100 
0100 
0011 
OODF 


i = 1, mk = 0100 1010 0001 0101 0100 0001 0001 0000 
After PS, mp = 1100 0110 0000 1100 1100 0000 1111 0000 
mv = 0100 0000 0000 0000 0000 0000 0011 0000 

m = 0001 1000 0111 0000 0000 1111 0000 1111 

x = 000a e000 Oijk 0000 0000 uvwx 0000 zBDF 


i = 2, mk = 0000 1000 0001 0001 0000 0001 0000 0000 
After PS, mp = 0000 0111 1111 0000 1111 1111 0000 0000 
mv = 0000 0000 0111 0000 0000 1111 0000 0000 

m = 0001 1000 0000 0111 0000 0000 1111 1111 

x = 000a e000 0000 Oijk 0000 0000 uvwx zBDF 


i = 3, mk = 0000 1000 0000 0001 0000 0000 0000 0000 
After PS, mp = 0000 0111 1111 1111 0000 0000 0000 0000 
mv = 0000 0000 0000 0111 0000 0000 0000 0000 

m = 0001 1000 0000 0000 0000 0111 1111 1111 

x = 000a e000 0000 0000 0000 Oijk uvwx zBDF 


i= 4, mk = 0000 1000 0000 0000 0000 0000 0000 0000 
After PS, mp = 1111 1000 0000 0000 0000 0000 0000 0000 
mv = 0001 1000 0000 0000 0000 0000 0000 0000 

m = 0000 0000 0000 0000 0001 1111 1111 1111 

x = 0000 0000 0000 0000 000a eijk uvwx zBDF 


FIGURE 7-11. Operation of the parallel suffix method for the 
compress operation. 


In situation (2), in which the value of m is known, the same 
sort of thing can be done, and further optimization may be 
possible. It might happen that one of the five masks is O, in 
which case one of the five lines shown above can be omitted. 
For example, mask m1 is O if it happens that no bit moves an odd 
number of positions, and m4 is 0 if no bit moves more than 15 
positions, and so on. 


As an example, for 
Click here to view code image 
m = 0101 0101 0101 0101 0101 0101 0101 0101, 
the calculated masks are 
Click here to view code image 
πινο = 0100 0100 0100 0100 0100 0100 0100 0100 


mvi 0011 0000 0011 0000 0011 0000 0011 0000 
mv2 = 0000 1111 0000 0000 0000 1111 0000 0000 


0000 0000 1111 1111 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 


mv3 
mv4 


Because the last mask is 0, in the compiled code situation this 
compression operation is done in 17 instructions (not counting 
the loading of the masks). This is not quite as good as the code 
shown for this operation on page 141 (13 instructions, not 
counting the loading of masks), which takes advantage of the 
fact that alternate bits are being selected. 


Using Insert and Extract 


If your computer has the insert instruction, preferably with 
immediate values for the operands that identify the bit field in 
the target register, then in the compiled situation insert can often 
be used to do the compress operation with fewer instructions 
than the methods discussed above. Furthermore, it doesn’t tie up 
registers holding the masks. 


The target register is initialized to 0, and then, for each 
contiguous group of 1’s in the mask m, variable x is shifted right 
to right-justify the next field, and the insert instruction is used to 
insert the bits of x in the appropriate place in the target register. 
This does the operation in 2n + 1 instructions, where n is the 
number of fields (groups of consecutive 1’s) in the mask. The 
worst case is 33 instructions, because the maximum number of 
fields is 16 (which occurs for alternating 1’s and 0’s). 


An example in which the insert method uses substantially 
fewer instructions is m = 0x0010084A. Compressing with this 
mask requires moving bits 1, 2, 4, 8, and 16 positions. Thus, it 
takes the full 21 instructions for the parallel suffix method, but 
only 11 instructions for the insert method (there are five fields). 
A more extreme case is m = 0х80000000. Here a single bit 
moves 31 positions, requiring 21 instructions for the parallel 
suffix method, but only three instructions for the insert method 
and only one instruction (shift right 31) if you are not 
constrained to any particular scheme. 


You can also use the extract instruction in various simple 
ways to do the compress operation with a known mask in 3n - 2 
instructions, where n is the number of fields in the mask. 


Clearly, the problem of compiling optimal code for the 
compress operation with a known mask is a difficult one. 


Compress Left 


To compress bits to the left, obviously you can reverse the 
argument x and the mask, compress right, and reverse the 
result. Another way is to compress right and then shift left by 
рор(#). These might be satisfactory if your computer has ап 
instruction for bit reversal or population count, but if not, the 
algorithm of Figure 7-10 is easily adapted: Just reverse the 
direction of all the shifts except the two in the expressions 1 << 
i (eight to change). 


The BESM-6 computer (ca. 1967) had an instruction for the 
compress left function (“Pack Bits in A Masked by X") and its 
inverse (“Unpack ...”), which operated on the machine's 48-bit 
registers. These instructions are not easy to implement. It is 
surmised by cryptography experts that their only use was for 
breaking US codes [Knu8]. The BESM-6 also had the population 
count instruction which, as has been noted, seems to be 
important to the National Security Agency. 


7-5 Expand, or Generalized Insert 


The inverse of the compress right function moves bits from the 
low-order end of a register to positions given by a mask, while 
keeping the bits in order. For example, expand(0000abcd, 
10011010) = a00bc0d0. Thus 


compress(expand(x, m), m) = x. 


This function has also been called unpack, scatter, and deposit. 


It can be obtained by running the code of Figure 7-10 in 
reverse [Allen]. To avoid overwriting bits in x, it is necessary to 
move (to the left) the bits that move a large distance first, and to 
move those that move only one position last. This means that 
the first five “move” quantities (mv in the code) must be 
computed, saved, and used in the reverse of the order in which 
they were computed. For many applications this is not a 
problem, because these applications apply the same mask m to 
large amounts of data, and so they would compute the move 
quantities in advance and reuse them anyway. 


The code is shown in Figure 7-12. It executes approximately 
168 basic RISC instructions (constant), including five stores and 
five loads. A 64-bit version for a 64-bit machine would execute 


approximately 200 instructions. 


For a machine that does not have the and not instruction, the 
MUX operation in the second loop can be coded in one fewer 
instruction with 


Click here to view code image 


^ 


x — ((x^ y) & mv) x; 


Click here to view code image 
unsigned expand(unsigned x, unsigned m) { 


unsigned m0, mk, mp, mv, t; 
unsigned array[5]; 


int i; 
πο = m; // Save original mask. 
mk = ~m<< 1; // We will count 015 to right. 


for (i = 0; i < 5; i++) { 


mp = mk ^ (mk << 1); // Parallel suffix. 
mp = mp ^ (mp << 2); 

mp = mp ^ (пр << 4); 

mp = mp ^ (mp << 8); 

mp = mp ^ (mp << 16); 

mv = mp & m; // Bits to move. 
array[i] = mv; 

m = (m^ mv) | (mv >> (1 << 1)); // Compress m. 


mv = аггау[1]; 
t = x<< (1<< i); 
X = (x & ~mv) | (t & mv); 


return x & m0; // Clear out extraneous bits. 


FIGURE 7-12. Parallel suffix method for the expand 
operation. 


7-6 Hardware Algorithms for Compress and Expand 


This section gives hardware-oriented algorithms for the compress 
right function and its inverse [Zadeck]. Like the algorithms of the 


preceding sections, their execution times are proportional to the 
log of the computers word size. They are suitable for 
implementation in hardware, but do not yield fast code if 
implemented in basic RISC instructions. We simply describe how 
they work without giving C or machine code. 


Compress 


To illustrate the operation of the algorithm, we represent each 
bit of x with a letter and consider a specific example mask m, 
shown below. 


Click here to view code image 


Input x = abcd efgh ijkl mnop qrst uvwx yzAB CDEF 
Mask m = 0111 1110 0110 1100 1010 1111 0011 0010 


The algorithm works in logo(W) “phases,” where W is the 
computer's word size in bits. Each phase operates in parallel on 
“pockets” of size 2” bits, for n ranging from 1 to log2(W). At the 
end of each phase, each pocket of x contains the original pocket 
of x with the bits selected by that pocket of m compressed to the 
right. Each pocket of m will contain an integer that is the 
number of 0-bits in that pocket of the original m. This is equal to 
the number of bits of x that are not compressed to the right. 
They are the known leading O-bits in the pocket of x. 

In each phase, the algorithm performs the following steps, in 
parallel, on each pocket of x and m, where w is the pocket size 
in bits. 

1. Set L — the left half of the pocket of x, extended with w / 

2 O-bits on the right. 

2. Shift L (all w bits) right by the amount given in the right 
half of the corresponding pocket of m, inserting 0’s on the 
left. No 1’s will be shifted out on the right, because the 
maximum shift amount is w / 2. 

3. Set R = w / 2 0-bits followed by the right half of the 
pocket of x. 


4. Replace the entire w-bit pocket of x with the or of R and 
the shifted L. 


5. Add the left and right halves of the pocket of m, and 
replace the entire pocket with the sum. 


To apply these steps to the first phase (w — 2) would require 
first and'ing x with m, to clear out irrelevant bits of x, and 
complementing m so that each bit of m is the number of 0-bits in 
each 1-bit half pocket. It is simpler to make an exception of the 
first phase, and combine these steps with the first compression 
operation by applying the logic shown in the table below to each 
2-bit pocket of x and m. 


Input | Output 
x m| x 


The third line, for example, has m = 10 (binary). This means 
that the left bit of x is selected to be part of the result, but the 
right bit is not. Thus, the left bit (a) is compressed to the right. 
The other bit of x is cleared, which ensures that in the final 
result, all the high-order (not selected) bits will be 0. 


Applying this logic to the original x and m gives: 


Click here to view code image 


Bit pairs, x 
m 


Орса ef0g 030k mn00 0405 uvwx 00AB 000E 
0100 0001 0101 0010 0101 0000 1000 1001 


In the second phase, consider for example the second nibble 
above (εεοσ). The quantities L = екоо and R = 000g are 
formed. L is shifted right by one position (given by the right half 
of the nibble of m), giving 0950. This is or'ed with R, giving 
0efg as the new value of the nibble. The left and right halves of 
m are added, giving 0001 (no change). 


Click here to view code image 


Nibbles, x 
m 


Орса Oefg 003k 00mn 0045 uvwx 00AB 000E 
0001 0001 0010 0010 0010 0000 0010 0011 


Similarly, for the third, fourth, and fifth phases, each byte, 
halfword, and word of x are compressed, and m is updated, as 
follows: 


Click here to view code image 


Bytes, x = 00bc defg 0000 jkmn 00qs uvwx 0000 OABE 
m = 0000 0010 0000 0100 0000 0010 0000 0101 


Click here to view code image 


Halfwords, x = 0000 00bc defg jkmn 0000 000q suvw xABE 
m = 0000 0000 0000 0110 0000 0000 0000 0111 


Click here to view code image 


Words, х 0000 0000 0000 Орса efgj Кипа suvw xABE 
m = 0000 0000 0000 0000 0000 0000 0000 1101 


Upon completion, m is an integer that gives the number of 
known leading 0’s in x. Subtracting this from the word size gives 
the number of compressed bits in x, which equals the number of 
1-bits in the original mask m. 


The reason this is not a very good algorithm for 
implementation with basic RISC instructions is that it is hard to 
shift the half-pockets right by differing amounts. On the other 
hand, it might possibly be useful on an SIMD machine that has 
instructions that operate on the pockets of a word in parallel and 
independently. 


Expand 


The hardware compression algorithm can be turned into an 
expansion algorithm by, essentially, running it first forward and 
then in reverse. As in the algorithms based on the parallel suffix 
method, the five masks of the hardware compression algorithm 
are computed, saved, and used in the reverse of the order in 
which they were computed. Actually, the last mask is not used 
(nor is it used in the compression algorithm), but an additional 
one is required (mO) that is simply the complement of the 
original mask. In the forward pass, only the steps for computing 
the masks need be done; those involving the data x can be 
omitted. 


To illustrate, suppose we have 


Click here to view code image 


Input x = abcd efgh ijkl mnop qrst uvwx yzAB CDEF 
Mask m = 0111 1110 0110 1100 1010 1111 0011 0010 


Then the result of the expansion should be 
Click here to view code image 


Onop qrsO Otu0 ум00 х0у0 zABC 00DE OOFO. 


The masks are shown below. 


Click here to view code image 


πο = 1000 0001 1001 0011 0101 0000 1100 1101 
ml = 0100 0001 0101 0010 0101 0000 1000 1001 
m2 = 0001 0001 0010 0010 0010 0000 0010 0011 
m3 = 0000 0010 0000 0100 0000 0010 0000 0101 
m4 = 0000 0000 0000 0110 0000 0000 0000 0111 


The integer values of each half of m4 give the number of 0- 
bits in the corresponding half of the original mask m. In 
particular, the right half of т has seven O-bits. This means that 
the seven high-order bits of the right half of x do not belong 
there—they should be in the left half of x. Thus, bits 9 through 
15 of x should be shifted left just enough to put them in the left 
half of x, and higher-order bits of x should be shifted left to 
accommodate them. This can be accomplished by shifting left 
the entire 32-bit word x by seven positions and replacing the left 
half of x with the left half of the shifted quantity. This gives 


Click here to view code image 


x = hijk Типо pqrs tuvw qrst uvwx yzAB CDEF. 


In general, the algorithm works with pocket sizes from 32 
down to 2, in five phases, using masks m4 down to тб. Each 
pocket (in parallel) is shifted left, discarding bits that are shifted 
out on the left, and supplying O's to vacated positions on the 
right, so that the shifted quantity is the same length as the 
pocket from which it came. Then the left half of the pocket is 
replaced by the left half of the shifted quantity. This will leave 
"garbage" bits in both halves of the pocket. They will be zeroed- 
out after the last phase by and'ing with the original mask. 


Continuing, we treat m3 as two 16-bit pockets. The left 
pocket has the integer 4 in its right half, so the left pocket of x is 
shifted left four positions (giving lmno ракз tuvw 0000), and the 
left half of this replaces the left half of the left pocket in x, 
making the left pocket of x = lmno pars. Performing the same 


operation on the right 16-bit pocket of x gives 


Click here to view code image 


x = Ίππο pqrs pqrs tuvw vwxy 2АВС yzAB CDEF. 


The next phase uses m2, which consists of four 8-bit pockets. 
Applying it to x gives 


Click here to view code image 


x = mnop pqrs rstu tuvw vwxy 2АВС BCDE CDEF. 


The next phase uses m1, which consists of eight 4-bit pockets. 
Applying it to x gives 


Click here to view code image 


x = mnop qrrs sttu умум wxxy 2АВС BCDE DEEF. 


The last phase uses m0, which consists of sixteen 2-bit 
pockets. Applying it to x gives 


Click here to view code image 


х = mnop qrss stuu vwww xxyy ZABC CCDE EEFF. 


The final step is to and this with the original mask to clear 
irrelevant bits. This gives 


Click here to view code image 


x = Onop qrsO Otu0 ум00 xOyO zABC 00DE OOFO. 


The half-pockets of each computed mask contain a count of 
the number of O-bits in the corresponding half-pocket of the 
original mask m. Therefore, as an alternative to computing the 
masks and saving them, the machine could employ circuits for 
doing a population count of the 0’s in the half-pockets “on the 
fly." 


7-7 General Permutations, Sheep and Goats 
Operation 


To do general permutations of the bits in a word, or of anything 
else, a central problem is how to represent the permutation. It 
cannot be represented very compactly. Because there are 32! 


permutations of the bits in a 32-bit word, at least Г1092(32!)1 = 
118 bits, or three words plus 22 bits, are required to designate 
one permutation out of the 32!. 


One interesting way to represent permutations is closely 
related to the compression operations discussed in Section 7-4 
[GLS1]. Start with the direct method of simply listing the bit 
position to which each bit moves. For example, for the 
permutation done by a rotate left of four bit positions, the bit at 
position 0 (the least significant bit) moves to position 4, 1 moves 
to 5, ..., 31 moves to 3. This permutation can be represented by 
the vector of 32 5-bit indexes: 


Click here to view code image 


00100 
00101 


11111 
00000 
00001 


00010 
00011 


Treating that as a bit matrix, the representation we have in 
mind is its transpose, except reflected about the off diagonal so 
the top row contains the least significant bits and the result uses 
little-endian bit numbering. This we store as five 32-bit words in 
array p: 


Click here to view code image 


p[0] = 1010 1010 1010 1010 1010 1010 1010 1010 
p[1] = 1100 1100 1100 1100 1100 1100 1100 1100 
p[2] = 0000 1111 0000 1111 0000 1111 0000 1111 
p[3] = 0000 1111 1111 0000 0000 1111 1111 0000 
p[4] = 0000 1111 1111 1111 1111 0000 0000 0000 


Each bit of ро] is the least significant bit of the position to 
which the corresponding bit of x moves, each bit of р[11 is the 
next more significant bit, and so on. This is similar to the 
encoding of the masks denoted by nv in the previous section, 
except that mv applies to revised masks in the compress 
algorithm, not to the original mask. 


The compression operation we need compresses to the left all 
bits marked with 1's in the mask, and compresses to the right all 


bits marked with 0’s.2 This is sometimes called the “sheep and 
goats" operation (SAG), or "generalized unshuffle." It can be 
calculated with 


Click here to view code image 
SAG(x, m) = compress left(x, m) | compress(x, ~m). 


With SAG as a fundamental operation, and a permutation p 
as described above, the bits of a word x can be permuted by » 
in the following 15 steps: 


Click here to view code image 


x = SAG(x, p[0]); 
p[1] = SAG(p[1], p[0]); 
p[2] = SAG(p[2], p[0]); 
p[3] = SAG(p[3], р[0]); 
p[4] = SAG(p[4], р(01), 
x = SAG(x, p[11); 
p[2] = SAG(p[2], р[1]); 
p[3] = SAG(p[3], р[1]); 
p[4] = SAG(p[4], р[1]); 
x = SAG(x, р[2]); 
p[3] = SAG(p[3], р[2]); 
р[4] = SAG(p[4], р[2]); 
x = SAG(x, p[3]); 
р[ 4] = SAG(p[4], p[3]); 
х = АС (х, p[41); 


In these steps, SAG is used to perform a stable binary radix 
sort. Array p is used as 32 5-bit keys to sort the bits of x. In the 
first step, all bits of x for which р(01 = 1 are moved to the left 
half of the resulting word, and all those for which pto] = 0 are 
moved to the right half. Other than this, the order of the bits is 
not changed (that is, the sort is “stable”). Then all the keys that 
will be used for the next round of sorting are similarly sorted. 
The sixth line is sorting x based on the second least significant 
bit of the key, and so on. 


Similar to the situation of compressing, if a certain 
permutation p is to be used on a number of words x, then a 
considerable savings results by precomputing most of the steps 


above. The permutation array is revised to 


Click here to view code image 


р[1] = SAG(p[1], р[0]); 

р[2] = SAG(SAG(p[2], p[0]), р[ 11); 

p[3] = SAG(SAG(SAG(p[3], р(01), р[1]), pl2]); 

р[ 4] =  SAG(SAG(SAG(SAG(p[4], р(01), р[1]), РГ21), 
р[3]); 


апа then each permutation is done with 


Click here to view code image 


x = SAG(x, p[01); 
x = SAG(x, р[1]); 
x — SAG(x, p[21); 
x — SAG(x, p[31); 
x = SAG(x, р[4]); 


A more direct (but perhaps less interesting) way to do general 
permutations of the bits in a word is to represent a permutation 
as a sequence of 32 5-bit indexes. The kth index is the bit 
number in the source from which the kth bit of the result comes. 
(This is a “comes from" list, whereas the SAG method uses a 
"goes to" list.) These could be packed six to a 32-bit word, thus 
requiring six words to hold all 32 bit indexes. An instruction can 
be implemented in hardware such as 


Click here to view code image 
bitgather Rt,Rx,Ri, 


where register nt is a target register (and also a source), register 
nx contains the bits to be permuted, and register ni contains six 
5-bit indexes (and two unused bits). The operation of the 
instruction is 


t < (t < 6) 


X, X, X, X, X, Х|. 


In words, the contents of the target register are shifted left six 
bit positions, and six bits are selected from word x and placed in 
the vacated six positions of t. The bits selected are given by the 
six 5-bit indexes in word i, taken in left-to-right order. The bit 
numbering in the indexes could be either little- or big-endian, 


and the operation would probably be as described for either type 
of machine. 


To permute a word, use a sequence of six such instructions, 
all with the same nt and nx, but different index registers. In the 
first index register of the sequence, only indexes i4 and is are 
significant, as the bits selected by the other four indexes are 
shifted out of the left end of πε. 


An implementation of this instruction would most likely 
allow index values to be repeated, so the instruction can be used 
to do more than permute bits. It can be used to repeat any 
selected bit any number of times in the target register. The SAG 
operation lacks this generality. 


It is not unduly difficult to implement this as a fast (e.g., one 
cycle) instruction. The bit selection circuit consists of six 32:1 
MUX's. If these are built from five stages of 2:1 MUX's in today's 
technology (6: 31 = 186 MUX's in all), the instruction would be 
faster than a 32-bit add instruction [MD]. 


Some of the Intel machines have instructions that work much 
like the bit permutation operation described, but that permute 
bytes, “words” (16 bits), and *doublewords" (32 bits). These are 
PSHUFB, PSHUFW, and PSHUFD (Shuffle Packed Bytes/Words/ 
Doublewords). 


Permuting bits has applications in cryptography, and the 
closely related operation of permuting subwords (e.g., permuting 
the bytes in a word) has applications in computer graphics. Both 
of these applications are more likely to deal with 64-bit words, 
or possibly with 128, than with 32. The SAG and bitgather 
methods apply with obvious changes to these larger word sizes. 


To encrypt or decrypt a message with the Data Encryption 
Standard (DES) algorithm requires a large number of 
permutation-like mappings. First, key generation is done, once 
per session. This involves 17 permutation-like mappings. The 
first, called ^permuted choice 1," maps from a 64-bit quantity to 
a 56-bit quantity (it selects the 56 non-parity bits from the key 
and permutes them). This is followed by 16 permutation-like 
mappings from 56 bits to 48 bits, all using the same mapping, 
called “permuted choice 2.” 


Following key generation, each block of 64 bits in the 
message is subjected to 34 permutation-like operations. The first 


and last operations are 64-bit permutations, one being the 
inverse of the other. There are 16 permutations with repetitions 
that map 32-bit quantities to 48 bits, all using the same 
mapping. Finally, there are 16 32-bit permutations, all using the 
same permutation. The total number of distinct mappings is six. 
They are all constants and are given in [DES]. 


DES is obsolete, as it was proved to be insecure in 1998 by 
the Electronic Frontier Foundation, using special hardware. The 
National Institute of Standards and Technology (NIST) has 
endorsed a temporary replacement called Triple DES, which 
consists of DES run serially three times on each 64-bit block, 
each time with a different key (that is, the key length is 192 bits, 
including 24 parity bits). Hence, it takes three times as many 
permutation operations as does DES to encrypt or decrypt. 


The *permanent" replacement for DES and Triple DES, the 
Advanced Encryption Standard (previously known as the 
Rijndael algorithm [AES]), involves no bit-level permutations. 
The closest it comes to a permutation is a simple rotation of 32- 
bit words by a multiple of 8-bit positions. Other encryption 
methods proposed or in use generally involve far fewer bit-level 
permutations than DES. 


To compare the two permutation methods discussed here, the 
bitgather method has the advantages of (1) simpler preparation 
of the index words from the raw data describing the 
permutation, (2) simpler hardware, and (3) more general 
mappings. The SAG method has the advantages of (1) doing the 
permutation in five rather than six instructions, (2) having only 
two source registers in its instruction format (which might fit 
better in some RISC architectures), (3) scaling better to permute 
a doubleword quantity, and (4) permuting subwords more 
efficiently. 

Item (3) is discussed in [LSY]. The SAG instruction allows for 
doing a general permutation of a two-word quantity with two 
executions of the SAG instruction, a few basic RISC instructions, 
and two full permutations of single words. The bitgather 
instruction allows for doing it by executing three full 
permutations of single words, plus a few basic RISC instructions. 
This does not count preprocessing of the permutation to produce 
new quantities that depend only on the permutation. We leave it 
to the reader to discover these methods. 


Regarding item (4), to permute, for example, the four bytes 
of a word with bitgather requires executing six instructions, the 
same as for a general bit permutation by bitgather. But with SAG 
it can be done in only two instructions, rather than the five 
required for a general bit permutation by SAG. The gain in 
efficiency applies even when the subwords are not a power of 2 
in size; the number of steps required is logon, where n is the 
number of subwords, not counting a possible non-participating 
group of bits that stays at one end or the other. 


[LSY] discusses the SAG and bitgather instructions (called 
“GRP” and “PPERM,” respectively), other possible permutation 
instructions based on networks, and permuting by table lookup. 


There is a neat hack to add 1 to the goats—that is, to 
compute 


SAG (8АС(х, m) + 1, m) 


without using the SAG function or its inverse [Knu8]. Here we 
assume SAG(x, m) puts the goats on the right, and the addition 
does not overflow into the "sheep" field. We leave to the reader 
the pleasure of discovering this trick. 


7-8 Rearrangements and Index Transformations 


Many simple rearrangements of the bits in a computer word 
correspond to even simpler transformations of the coordinates, 
or indexes, of the bits [GLS1]. These correspondences apply to 
rearrangements of the elements of any one-dimensional array 
provided the number of array elements is an integral power of 2. 
For programming purposes, they are useful primarily when the 
array elements are a computer word or larger in size. 


As an example, the outer perfect shuffle of the elements of an 
array A of size eight, with the result in array B, consists of the 
following moves: 


А, > Ву: A, B, 4А, > B;; A4 Bg; 


Each B-index is the corresponding A-index rotated left one 
position, using a 3-bit rotator. The outer perfect unshuffle is, of 
course, accomplished by rotating right each index. Some similar 


correspondences are shown in Table 7-1. Here n is the number 
of array elements, “lsb” means least significant bit, and the 
rotations of indexes are done with a logon-bit rotator. 


TABLE 7-1. REARRANGEMENTS AND INDEX TRANSFORMATIONS 


Index Transformation 


Rearrangement Array Index, or Big- Little-endian Bit 
endian Bit Numbering Numbering 
Reversal Complement Complement 


Bit flip, or generalized Exclusive or with a constant | Exclusive or with a constant 
reversal (page 135) 


Rotate left k positions Subtract k (mod л) Add k (mod n) 

Rotate right k positions Add & (mod п) Subtract А (mod л) 
Outer perfect shuffle Rotate left one position Rotate right one position 
Outer perfect unshuffle Rotate right one position Rotate left one position 


Inner perfect shuffle Rotate left one, then com- Complement 180, then 
plement Isb rotate right one 


Inner perfect unshuffle Complement Isb, then Rotate left one, then com- 
rotate right plement Isb 

Transpose of an 8x8-bit Rotate (left or right) three Rotate (left or right) three 

matrix held in a 64-bit positions positions 

word 


FFT unscramble Reverse bits Reverse bits 


7-9 An LRU Algorithm 


Ever wonder how your computer keeps track of which cache line 
is the least recently used? Here we describe one such algorithm, 
known as the reference matrix method. It is primarily a hardware 
algorithm, but it might have application in software. 


We won’t go into a long discussion of the intriguing world of 
caches, but only say that we have in mind the high-speed caches 
that buffer data between a computer’s main memory and the 
processor. These caches may get a request for a word every 
computer cycle, and they should usually respond with the data 
within a cycle or two, so there is not much time for a 
complicated algorithm. 


A cache contains a copy of a subset of the data in main 
memory, and the problem we are addressing is: when a cache 
miss occurs (that is, when a word at a certain address is 
requested and the data at that address are not in the cache), how 
does the computer decide which block (or line, in cache jargon) 


to replace with the requested data? Ideally, it should replace the 
data in the line that will not be referenced for the longest time 
in the future. But we cannot know the future, so we have to 
guess. The best guess over a wide variety of application 
programs seems to be the least recently used (LRU) policy. This 
policy replaces the line that has not been referenced for the 
longest time. 


Caches come in three varieties: direct-mapped, fully associative, 
and set-associative. In a direct-mapped cache, certain bits of the 
address of the load or store instruction directly address a 
particular cache line. When a miss occurs, there is no question as 
to what line to replace—it must be the addressed line. There is 
no need for an LRU or any other guessing policy. 


In a fully associative cache, a block from main memory can 
be placed in any cache line. When a load or store is executed, 
the address is looked up to see if it is in the cache. If not, it is 
necessary to replace the contents of some line. The machine has 
complete flexibility in the choice of line to replace. Several 
strategies have been used (FIFO, random, and LRU are the most 
common) and, as mentioned above, LRU seems to be the one 
that most often results in the lowest miss rate. Unfortunately, 
LRU is the most expensive to implement when there are many 
lines to consider for replacement. 


Often the set-associative organization is chosen. It is a 
compromise between direct-mapped and fully associative. The 
designer decides on the degree of associativity, which is usually 
2, 4, 8, or 16. The cache is divided into a number of "sets," each 
of which contains 2, 4, 8, or 16 lines (typically). The set is 
directly addressed, using certain bits of the load or store address, 
but the line within the set must be looked up. The lookup in the 
set is done much the same as in the case of a fully associative 
cache. Now, when it is necessary to replace a line, the LRU 
algorithm need only determine which of the lines within one set 
is the least recently used, and replace that. 


With this brief background, we can describe the reference 
matrix method. To illustrate, assume the cache is four-way set- 
associative. This means that there are four lines for which we 
wish to keep track of the least recently used (referenced). The 
cache may be fully associative and consist of only four lines, or 
it may be set-associative with four lines per set. 


The reference matrix method employs a square bit matrix of 
dimension equal to the degree of associativity (in principle; we 
will modify this statement later). Each associative set has one 
such matrix. The essence of the method is that when line i is 
referenced, row i of the matrix is set to 1’s, and then column i is 
set to O's. Figure 7-13 illustrates the changes in the matrix from 
an initial state to its configuration after a reference to lines 3, 1, 
0, 2, 0, 3, and 2, in that order. 


Init 3 1 0 2 0 3 2 
0123 


Line 
ων. 


FIGURE 7-13. Illustration of the reference matrix method. 


Each matrix has a row containing three 1’s, two 1’s, one 1, 
and no 1’s. The number of the row with no 1’s is the least 
recently used line. The number of the row with one 1 is the next 
least recently used line, and so on. When a cache miss occurs, 
the machine finds the row with all O's and replaces the 
corresponding line. It then records it as the most recently used 
line by setting its row to all 1’s and its column to all 0’s. 


Why does this work? Denoting the matrix by M, the reason it 
works is that Mj indicates whether or not line i is more recently 
used than line j. If Μη = 1, line i is more recently used than line 
j, and НМ; = 0, line i is not more recently used than line j. 


Consider an arbitrary 4х4 matrix for which line 2 is 
referenced. Then the matrix changes as shown in Figure 7-14. 
Setting row i to 1’s (except for the element on the main 
diagonal) is recording that line i is more recently used than line 
j. for all j = i. Setting column i to 0’s is recording that line j is 
not more recently used than line i, for all j. Relations among 
cache lines other than i are not changed. When all the lines have 
been referenced, all the *more recently used" relations will be 
established. 


Thus, the reference matrix is antisymmetric and the main 
diagonal is always all 0’s. Therefore, only part of the matrix, 
either the elements above the main diagonal or those below the 
main diagonal, need be stored in the cache. That is what is done 


in practice. For an n-way associative set, n(n — 1)/2 memory bits 
are required. For n — 4, this is six; for n — 8, it is 28. Twenty- 
eight is getting to be a bit large, so the reference matrix method, 
and in fact the true LRU policy, is not often used for degrees of 
associativity greater than 8. Instead, there are approximate LRU 
methods and methods that are not LRU at all. 


In software, the LRU policy would probably be implemented 
with a list of the line numbers (either a simple vector or a linked 
list). When line i is referenced, the list is searched for i, and then 
i is moved to the top of the list. The least recently used line 
number then migrates to the bottom of the list. 


That method is relatively slow on references (because of 
rearranging the list), but fast in deciding which line to replace. 
Another method, with the opposite speed characteristics, is to 
have a vector of length equal to the degree of associativity, with 
position i holding both the address that line i holds and its “age” 
(actually “newness”) encoded as an integer. When line i is 
referenced, a single variable that holds the current “age” is 
incremented, and the resulting value is stored in the vector at 
position i. To find the least recently used line, the vector is 
searched for the line with the smallest value of “age.” This 
method fails if the “age” integer overflows. 


Init 2 
0123 


Line 
U N > 


FIGURE 7-14. One step of the reference matrix method. 


There might be one “age” integer per associative set, or only 
one for the whole cache, or in hardware a cycle counter could be 
used. 


The reference matrix method might be useful in software 
when the degree of associativity is small. For example, suppose 
an application uses eight-way set-associativity and is to run on a 
64-bit machine. Then the reference matrix can be stored in a 
single 64-bit register. Let the low-order eight bits of the register 


hold row 0 of the matrix, the next eight bits hold row 1, and so 
forth. Then when line i is referenced, byte i of the register 
should be set to 15, and bits i, i + 8, ..., i + 56 should be 
cleared. Denoting the register by m, this is accomplished as 
shown here. 


т < m | (OxFF <(8 «i)) 
m «— m & —(0x01010101 01010101 <i) 


This amounts to five or six instructions, plus a few to load 
constants. To find the least recently used line, search for an all- 
zero byte (see Section 6-1). The advantage of this method over 
the other software methods briefly outlined above is that all the 
work is done in a register. 


Exercises 


1. Explain the workings of the second Mobius formula 
(Equation (1), page 139). 


2. The perfect outer shuffle operation and its inverse employ 
the following masks: 


то = 0х22222222, 


m, = 0xOCOCOCOC, 
m, = 0х00Е000ЕбО, and 
т; = 0х0000ЕЕО0. 


What is a formula for the general case, my? A formula 
might be useful in situations in which an upper bound on 
the length of the integers being shuffled is not known in 
advance, such as in “bignum” applications. 


3. Code a function similar to the compress function of Figure 
7-9 that does the expand operation. 


4. For an n-way set-associative cache, what is the theoretical 
minimum number of bits required to implement the LRU 
policy? Compare that to the number of bits required for 
the reference matrix method, for a few small values of n. 


Chapter 8. Multiplication 


8-1 Multiword Multiplication 


This can be done with, basically, the traditional grade-school 
method. But rather than develop an array of partial products, it 
is more efficient to add each new row, as it is being computed, 
into a row that will become the product. 


If the multiplicand is m words, and the multiplier is n words, 
then the product occupies m + n words (or fewer), whether 
signed or unsigned. 


In applying the grade-school scheme, we would like to treat 
each 32-bit word as a single digit. This works out well if an 
instruction that gives the 64-bit product of two 32-bit integers is 
available. Unfortunately, even if the machine has such an 
instruction, it is not readily accessible from most high-level 
languages. In fact, many modern RISC machines do not have this 
instruction in part because it isn't accessible from high-level 
languages and thus would not be used often. (Another reason is 
that the instruction would be one of a very few that give a two- 
register result.) 

Our procedure is shown in Figure 8-1. It uses halfwords as 
the “digits.” Parameter w gets the result, and u and v are the 
multiplier and multiplicand, respectively. Each is an array of 
halfwords, with the first halfword (wto1, ufo], and у(01) being 
the least significant digit. This is “little-endian” order. 
Parameters m and n are the number of halfwords in u and v, 
respectively. 

The picture below may help in understanding. There is no 
relation between m and n; either may be the larger. 


Um-1Um-2 ... ... u чо 
» Vn-1 ee Vi Vo 


Wm*n-1 Wm*n-2 “ee “ee “ee Wi Wo 
The procedure follows Algorithm M of [Knu2, 4.3.1] but is 
coded in C and modified to perform signed multiplication. 


Observe that the assignment to t in the upper half of Figure 8-1 
cannot overflow, because the maximum value that could be 
assigned to + is (216 – 1)2 + 2(216 - 1) = 232- 1. 


Multiword multiplication is simplest for unsigned operands. 
In fact, the code of Figure 8-1 performs unsigned multiplication 
if the "correction" steps (the lines between the three-line 
comment and the "return" statement) are omitted. An unsigned 
version can be extended to signed in three ways: 


1. Take the absolute value of each input operand, perform 
unsigned multiplication, and then negate the result if the 
input operands had different signs. 


2. Perform the multiplication using unsigned elementary 
multiplication, except when multiplying one of the high- 
order halfwords, in which case use signed х unsigned or 
signed x signed multiplication. 


3. Perform unsigned multiplication and then correct the 
result somehow. 


Click here to view code image 


void mulmns (unsigned short w[], unsigned short uf[], 
unsigned short v[], int m, int n) { 
unsigned int k, t, b; 
int i, j; 


і = 0; i < m; itt) { 

t = u[i]*v[j] + w[i + j] + k; 

w[i + j] = t; // (I.e., t & OxFFFF). 
k = 


// Now w[] has the unsigned product. Correct by 
// subtracting v*2**16m if u « 0, and 
// subtracting u*2**16n if v < 0. 


if ((short)u[m - 1] « 0) { 
b = 0; // Initialize borrow. 


for (3 = 0; j < n; jtt) í 
t = w[j + m] - v[j] - b; 
w[j + m] = t; 
b= t >> 31; 


if ((short)v[n - 1] < 0) { 


b = 0; 
for (i = 0; i < m; itt) { 
Е = wili + n] = u[i] = b; 
w[i + n] = t; 
p = t >> 31; 
} 
} 
return; 


} 


FIGURE 8-1. Multiword integer multiplication, signed. 


The first method requires passing over as many as m + n 
input halfwords to compute their absolute value. Or, if one 
operand is positive and one is negative, the method requires 
passing over as many as max(m, n) + m + n halfwords to 
complement the negative input operand and the result. Perhaps 
more serious, the algorithm would alter its inputs (which we 
assume are passed by address), which may be unacceptable in 
some applications. Alternatively, it could allocate temporary 
space for them, or it could alter them and later change them 
back. All these alternatives are unappealing. 


The second method requires three kinds of elementary 
multiplication (unsigned x unsigned, unsigned x signed, and 
signed x signed) and requires sign extension of partial products 
on the left, with O's or 1’s, making each partial product take 
longer to compute and add to the running total. 


We choose the third method. To see how it works, let u and v 
denote the values of the two signed integers being multiplied, 
and let them be of lengths M and N bits, respectively. Then the 
steps in the upper half of Figure 8-1 erroneously interpret u as 
an unsigned quantity, having value и + 2Mum . 1, where uM _ 1 
is the sign bit of u. That is, им -1 = 1 if u is negative, and uM _ 1 
= 0 otherwise. Similarly, the program interprets v as having 
value v + 2Nun _ 1. 


The program computes the product of these unsigned 


numbers—that is, it computes 


(и + 2ΜιΜ. ι)(ν + 2ΝΥΝ. 1) = иу + 2Мим ιν + 2М№м_1и + 
2M + Num _ 1ΥΝ. 1. 


To get the desired result (uv), we must subtract from the 
unsigned product the value 2Мим _ ιν + 2NyN . ти. There is no 
need to subtract the term 2M + Num _ уум _ 1, because we know 
that the result can be expressed in M -- N bits, so there is no 
need to compute any product bits more significant than bit 
position M + N - 1. These two subtractions are performed by 
the steps below the three-line comment in Figure 8-1. They 
require passing over a maximum of m + n halfwords. 


It might be tempting to use the program of Figure 8-1 by 
passing it an array of fullword integers—that is, by “lying across 
the interface." Such a program will work on a little-endian 
machine, but not on a big-endian one. If we had stored the 
arrays in the reverse order, with uto] being the most significant 
halfword (and the program altered accordingly), the “lying” 
program would work on a big-endian machine, but not on a 
little-endian one. 


8-2 High-Order Half of 64-Bit Product 


Here we consider the problem of computing the high-order 32 
bits of the product of two 32-bit integers. This is the function of 
our basic RISC instructions multiply high signed (muins) and 
multiply high unsigned (mu1nu). 


For unsigned multiplication, the algorithm in the upper half 
of Figure 8-1 works well. Rewrite it for the special case m — n 
— 2, with loops unrolled, obvious simplifications made, and the 
parameters changed to 32-bit unsigned integers. 


For signed multiplication, it is not necessary to code the 
"correction steps" in the lower half of Figure 8-1. These can be 
omitted if proper attention is paid to whether the intermediate 
results are signed or unsigned (declaring them to be signed 
causes the right shifts to be sign-propagating shifts). The 
resulting algorithm is shown in Figure 8-2. For an unsigned 
version, simply change all the int declarations to unsigned. 


The algorithm requires 16 basic RISC instructions in either 
the signed or unsigned version, four of which are 


multiplications. 


Click here to view code image 


int mulhs(int u, int v) { 
unsigned u0, v0, м0; 
int ul, vl, wl, w2, t; 


u0 = u & OxFFFF; ul = а >> 16; 
νο = v & OxFFFF; vl = v >> 16; 
м0 = u0*v0; 

t = ul*v0 + (w0 >> 16); 

wl = t 8 ΟΧΕΕΕΕ; 


м2 = t >> 16; 
wl = u0*vl + wl; 
return ul*vl + м2 + (wl >> 16); 


FIGURE 8-2. Multiply high signed. 
8-3 High-Order Product Signed from/to Unsigned 


Assume that the machine can readily compute the high-order 
half of the 64-bit product of two unsigned 32-bit integers, but we 
wish to perform the corresponding operation on signed integers. 
We could use the procedure of Figure 8-2, but that requires four 
multiplications; the procedure to be given [BGN] is much more 
efficient than that. 


The analysis is a special case of that done to convert Knuth’s 
Algorithm M from an unsigned to a signed multiplication routine 
(Figure 8-1). Let x and y denote the two 32-bit signed integers 
that we wish to multiply together. The machine will interpret x 
as an unsigned integer, having the value x + 232x31, where x31 
is the most significant bit of x (that is, x31 is the integer 1 if x is 
negative, and О otherwise). Similarly у under unsigned 
interpretation has the value y + 232y31. 


Although the result we want is the high-order 32 bits of xy, 
the machine computes 


(x + 232x31)(y + 232y31) = ху + 232(хз1 у + yai) + 
264x331. 


To get the desired result, we must subtract from this the 
quantity 232(хз1у + уз1х) + 264x31y31. Because we know that 


the result can be expressed in 64 bits, we can perform the 
arithmetic modulo 294, This means that we can safely ignore the 
last term, and compute the signed high-order product as shown 
below (seven basic RISC instructions). 


р < mulhu(x, у) // multiply high unsigned instruction. 

fj — (x > 31) & y / {1 = ху. i 
(1) 

t, < (у 31) &х ДЭЭЛ 

p«p-t-t, // p = desired result. 


Unsigned from Signed 


The reverse transformation follows easily. The resulting program 
is the same as (1), except with the first instruction changed to 
multiply high signed and the last operation changed to p< p + t1 
+ t2. 


8-4 Multiplication by Constants 


It is nearly a triviality that one can multiply by a constant with a 
sequence of shift left and add instructions. For example, to 
multiply x by 13 (binary 1101), one can code 


1-х<2 
t,«€-x «3 


r< t +ILt,+x 
where r gets the result. 


In this section, left shifts are denoted by multiplication by a 
power of 2, so the above plan is written r — 8x + 4x + x, 
which is intended to show four instructions on the basic RISC 
and most machines. 


What we want to convey here is that there is more to this 
subject than meets the eye. First of all, there are other 
considerations besides simply the number of shifts and add's 
required to do a multiplication by a given constant. To illustrate, 
below are two plans for multiplying by 45 (binary 101101). 


t «— 4x t, < 4x 


re€-x-t t, < 8x 
t «€- 2t t, < 32x 
rer+t ret tx 
t «— 41 t, «€t, + t, 
pert r<ertt, 


The plan on the left uses a variable t that holds x shifted left 
by a number of positions that corresponds to a 1-bit in the 
multiplier. Each shifted value is obtained from the one before it. 
This plan has these advantages: 


* It requires only one working register other than the input 
x and the output r. 


* Except for the first two, it uses only 2-address 
instructions. 


* The shift amounts are relatively small. 


The same properties are retained when the plan is applied to any 
multiplier. 


The scheme on the right does all the shift’s first, with x as the 
operand. It has the advantage of increased parallelism. On a 
machine with sufficient instruction-level parallelism, the scheme 
on the right executes in three cycles, whereas the scheme on the 
left, running on a machine with unlimited parallelism, requires 
four. 


In addition to these details, it is nontrivial to find the 
minimum number of operations to accomplish multiplication by 
a constant, where by an "operation" we mean an instruction 
from a typical computer's set of add and shift instructions. In 
what follows, we assume this set consists of add, subtract, shift 
left by any constant amount, and negate. We assume the 
instruction format is three-address. However, the problem is no 
easier if one is restricted to only add (adding a number to itself, 
and then adding the sum to itself, and so on, accomplishes a 
shift left of any amount), or if one augments the set by 
instructions that combine a left shift and an add into one 
instruction (that is, such an instruction computes z < x + (y << 


n)). We also assume that only the least-significant 32 bits of the 
product are wanted. 


The first improvement to the basic binary decomposition 
scheme suggested above is to use subtract to shorten the 
sequence when the multiplier contains a group of three or more 
consecutive 1-bits. For example, to multiply by 28 (binary 
11100), we can compute 32x - 4x (three instructions) rather 
than 16x + 8x + 4x (five instructions). On two’s-complement 
machines, the result is correct (modulo 232) even if the 
intermediate result of 32x overflows. 


To multiply by a constant m with the basic binary 
decomposition scheme (using only shift's and add's) requires 


2pop(m)-1-8 


instructions, where 6 = 1 if m ends in a 1-bit (is odd), and 6 = 
0 otherwise. If subtract is also used, it requires 


4g(m) + 2s(m) - 1-6 


instructions, where g(m) is the number of groups of two or more 
consecutive 1-bits in m, s(m) is the number of “singleton” 1-bits 
in m, and 8 has the same meaning as before. 


For a group of size 2, it makes no difference which method is 
used. 


The second improvement is to treat specially groups that are 
separated by a single O-bit. For example, consider m = 55 
(binary 110111). The group method calculates this as (64x - 
16x) + (8x - x), which requires six instructions. Calculating it 
as 64x - 8x - x, however, requires only four. Similarly, we can 
multiply by binary 110111011 as illustrated by the formula 
512x - 64x - 4x - x (six instructions). 


The formulas above give an upper bound on the number of 
operations required to multiply a variable x by any given 
number m. Another bound can be obtained based on the size of 
m in bits—that is, on n = (1082 m | + 1. 

THEOREM. Multiplication of a variable x by an n-bit constant m, 
m = 1, can be accomplished with at most n instructions of the type 
add, subtract, and shift left by any given amount. 


Proof. (Induction on n.) Multiplication by 1 can be done in 0 


instructions, so the theorem holds for n = 1. For n > 1, if m 
ends in a O-bit, then multiplication by m can be accomplished by 
multiplying by the number consisting of the left n — 1 bits of m 
(that is, by m / 2), in n - 1 instructions, followed by a shift left of 
the result by one position. This uses n instructions altogether. 


If m ends in binary O1, then mx can be calculated by 
multiplying x by the number consisting of the left n — 2 bits of 
m, in n — 2 instructions, followed by a left shift of the result by 2, 
and an add of x. This requires n instructions altogether. 


If m ends in binary 11, then consider the cases in which it 
ends in 0011, 0111, 1011, and 1111. Let t be the result of 
multiplying x by the left n — 4 bits of m. If m ends in 0011, then 
mx = 16t + 2x + x, which requires (n - 4) + 4 =n 
instructions. If m ends іп 0111, then mx = 16t + 8x - x, which 
requires n instructions. If m ends in 1111, then mx = 16t + 16x 
- x, which requires n instructions. The remaining case is that m 
ends in 1011. 


It is easy to show that mx can be calculated in n instructions 
if m ends in 001011, 011011, or 111011. The remaining case is 
101011. 


This reasoning can be continued, with the “remaining case” 
always being of the form 101010...10101011. Eventually, the 
size of m will be reached, and the only remaining case is the 
number 101010...10101011. This n-bit number contains n / 2 + 
1 1-bits. By a previous observation, it can multiply x with 2(n / 
2 + 1)-2 = n instructions. 


Thus, in particular, multiplication by any 32-bit constant can 
be done in at most 32 instructions, by the method described 
above. By inspection, it is easily seen that for n even, the n-bit 
number 101010...101011 requires n instructions, and for n odd, 
the n-bit number 1010101...010110 also requires n instructions, 
so the bound is tight. 


The methodology described so far is not difficult to work out 
by hand or to incorporate into an algorithm such as might be 
used in a compiler; but such an algorithm would not always 
produce the best code, because further improvement is 
sometimes possible. This can result from factoring the multiplier 
m or some intermediate quantity along the way of computing 
mx. For example, consider again m = 45 (binary 101101). The 


methods described above require six instructions. Factoring 45 
as 5 - 9, however, gives a four-instruction solution: 


1<- 4х x 


г < 8t ^t 


Factoring сап be combined with the binary decomposition 
methods. For example, multiplication by 106 (binary 1101010) 
requires seven instructions by binary decomposition, but writing 
it as 7 · 15 + 1 leads to а five-instruction solution. For large 
constants, the smallest number of instructions that accomplish 
the multiplication may be substantially fewer than the number 
obtained by the simple binary decomposition methods described. 
For example, m = OxAAAAAAAB requires 32 instructions by 
binary decomposition, but writing this value as 2 · 5 : 17 : 257 · 
65537 + 1 gives a ten-instruction solution. (Ten instructions is 
probably not typical of large numbers. The factorization reflects 
the simple bit pattern of alternate 1’s and O's.) 


There does not seem to be a simple formula or procedure that 
determines the smallest number of shift and add instructions that 
accomplishes multiplication by a given constant m. A practical 
search procedure is given in [Bern], but it does not always find 
the minimum. Exhaustive search methods to find the minimum 
can be devised, but they are quite expensive in either space or 
time. (See, for example, the tree structure of Figure 15 in [Knu2, 
4.6.3].) 


This should give an idea of the combinatorics involved in this 
seemingly simple problem. Knuth [Knu2, 4.6.3] discusses the 
closely related problem of computing a" using a minimum 
number of multiplications. This is analogous to the problem of 
multiplying by m using only addition instructions. 


Exercises 


1. Show that for а 32x 32 = 64 bit multiplication, the low- 
order 32 bits of the product are the same whether the 
operands are interpreted as signed or unsigned integers. 


2. Show how to modify the muins function (Figure 8-2) so 
that it calculates the low-order half of the 64-bit product, 
as well as the high-order half. (Just show the calculation, 
not the parameter passing.) 


3. Multiplication of complex numbers is defined by 


(a + bi)(c + di) = ac - bd + (ad + Ρο). 


This can be done with only three multiplications.1 Let 


p = ас, 
q = bd, and 
г = (at b)(ctd). 


Then the product is given by 
p-q + GT -p - Qi 


which the reader can easily verify. 


Code a similar method to obtain the 64-bit product of 
two 32-bit unsigned integers using only three 
multiplication instructions. Assume the machine's multiply 
instruction produces the 32 low-order bits of the product 
of two 32-bit integers (which are the same for signed and 
unsigned multiplication). 


Chapter 9. Integer Division 


9-1 Preliminaries 


This chapter and the following one give a number of tricks and 
algorithms involving “computer division” of integers. In 
mathematical formulas we use the expression x / y to denote 
ordinary rational division, x + y to denote signed computer 
division of integers (truncating toward 0), and X ФУ to denote 
unsigned computer division of integers. Within C code, x/y, of 
course, denotes computer division, unsigned if either operand is 
unsigned, and signed if both operands are signed. 


Division is a complex process, and the algorithms involving it 
are often not very elegant. It is even a matter of judgment as to 
just how signed integer division should be defined. Most high- 
level languages and most computer instruction sets define the 
result to be the rational result truncated toward 0. This and two 
other possibilities are illustrated below. 


Click here to view code image 


truncating modulus floor 


7-3 - 2 rem 1 2 rem 1 2 rem 1 
(-7) +3 = -2 rem -1 -3 rem 2 -3 rem 2 
7==(-3) = -2 rem 1 2 rem 1 3 rem 
(-7)4 (-3) = 2 rem -1 3 rem 2 2 rem -1 


The relation dividend = quotient * divisor + remainder holds 
for all three possibilities. We define “modulus” division by 
requiring that the remainder be nonnegative.1 We define “floor” 
division by requiring that the quotient be the floor of the 
rational result. For positive divisors, modulus and floor division 
are equivalent. A fourth possibility, seldom used, rounds the 
quotient to the nearest integer. 


One advantage of modulus and floor division is that most of 
the tricks simplify. For example, division by 2" can be replaced 
by a shift right signed of n positions, and the remainder of 
dividing x by 2” is given by the logical and of x and 2? - 1. I 
suspect that modulus and floor division more often give the 


result you want. For example, suppose you are writing a 
program to graph an integer-valued function, and the values 
range from imin to imax. You want to set up the extremes of the 
ordinate to be the smallest multiples of 10 that include imin and 
imax. Then the extreme values are simply (imin + 10) * 10 and 
(Gmax + 9) + 10) * 10 if modulus or floor division is used. If 
conventional division is used, you must evaluate something like: 


Click here to view code image 


if (imin >= 0) gmin = (imin/10)*10; 
else gmin = ((imin - 9)/10)*10; 
if (imax >= 0) gmax = ((imax + 9)/10)*10; 
else gmax = (imax/10)*10; 


Besides the quotient being more useful with modulus or floor 
division than with truncating division, we speculate that the 
nonnegative remainder is probably wanted more often than a 
remainder that can be negative. 


It is hard to choose between modulus and floor division, 
because they differ only when the divisor is negative, which is 
unusual. Appealing to existing high-level languages does not 
help, because they almost universally use truncating division for 
x/y when the operands are signed integers. A few give floating- 
point numbers, or rational numbers, for the result. Looking at 
remainders, there is confusion. In Fortran 90, the mop function 
gives the remainder of truncating division and moputo gives the 
remainder of floor division (which can be negative). Similarly, in 
Common Lisp and ADA, REM is the remainder of truncating 
division, and MOD is the remainder of floor division. In PL/I, 
Mop is always nonnegative (it is the remainder of modulus 
division). In Pascal, a moa в is defined only for в > 0, and then 
it is the nonnegative value (the remainder of either modulus or 
floor division). 


Anyway, we cannot change the world even if we knew how 
we wanted to change it,2 so in what follows we will use the 
usual definition (truncating) for x + y. 


A nice property of truncating division is that it satisfies 
(-n) + d = n + (-d) = -(n + d), for d =0. 


Care must be exercised when applying this to transform 


programs, because if n or d is the maximum negative number, -n 
or —d cannot be represented in 32 bits. The operation (-231) + 
(-1) is an overflow (the result cannot be expressed as a signed 
quantity in two's-complement notation), and on most machines 
the result is undefined or the operation is suppressed. 

Signed integer (truncating) division is related to ordinary 
rational division by 

п/а], if а=0, па>0, 
п+а = Ln/d] D (1) 
[n/d]. if d#0,nd<0. 
Unsigned integer division—that is, division in which both n and 
d are interpreted as unsigned integers—satisfies the upper 
portion of (1). 

In the discussion that follows, we make use of the following 
elementary properties of arithmetic, which we don't prove here. 
See [Knul] and [GKP] for interesting discussions of the floor 
and ceiling functions. 

THEOREM D1. For x real, k an integer, 


Lx] = -[7x] [x] -L-x] 
x-1«lx]sx xs[x]«x-*1 

Ix| x«[x]* 1 [x]- 1«xs[x] 
xzkel|x|2k xske[x]|sk 
x>k>ə|x]zk x<k—>[x]<k 
xsk>|x]<sk>x<k+1 x>k—=[x]>k—=—x>k-1 
ΣΕΛ; x»ke[x]»k 


THEOREM D2. For n, d integers, d > 0, 


“ини жа! 


If d < 0: 


κα κα) 


THEOREM D3. For x real, d an integer > 0: 


Για ]J/d] = Lx/d] and [Tx|/d|- [x/d]. 


COROLLARY. For a, b real, b = 0, dan integer > 0, 


18117158 HZ 


THEOREM D4. Рог n, d integers, 4 = 0, and x real, 


Fk H if 0<х< 1, and 12551715 if — 


In the theorems below, rem(n, d) denotes the remainder of n 
divided by d. For negative d, it is defined by rem(n, -d) = 
rem(n, d), as in truncating and modulus division. We do not use 
rem(n, d) with n < 0. Thus, for our use, the remainder is always 
nonnegative. 


THEOREM D5. For n = 0, а =0, 


1| «хо. 


y 


2rem(n,d) or 2rem(n 4) +1 or 

2rem(n, d) — 14. 2rem(n, d) - |d| +1 

(whichever value is greater than or equal to 0 and less than |d|). 
THEOREM D6. For n = 0,d = 0, 


rem(2n, d) = | 


апа rem(2n +1, а) = | 


rem(2n, 2d) = 2rem(n, d). 


Theorems D5 and D6 are easily proved from the basic 
definition of remainder—that is, that for some integer q it 
satisfies 


п = qd + rem(n, d) with 0 < rem(n, d) < (4|, 


provided n = 0 and d = 0 (n and d can be non-integers, but we 
will use these theorems only for integers). 


9-2 Multiword Division 


As in the case of multiword multiplication, multiword division 
can be done by the traditional grade-school method. The details, 
however, are surprisingly complicated. Figure 9-1 is Knuth's 
Algorithm D [Knu2, 4.3.1], coded in C. The underlying form of 


division it uses is 32 +16 => 32. (Actually, the quotient of 
these underlying division operations is at most 17 bits long.) 


Click here to view code image 


int divmnu (unsigned short q[], 
const unsigned short πῇ], 


unsigned short r[], 


const unsigned short v[], 


int m, int n) { 
const unsigned b = 65536; // Number base (16 bits). 
unsigned short *un, *vn; // Normalized form of u, v. 
unsigned qhat; / / Estimated quotient 
digit. 
unsigned rhat; // A remainder. 
unsigned p; // Product of two digits. 
Int δ y πι by № 
if (m«n || m <= 0 || v[n-1] == 0) 
return 1; // Return if invalid param. 
if (n == 1) ( // Take care of 
k = 0; // the case of a 
for (j = m = 1; j >= 0; j--) { // single-digit 
9151 = (k*b + u[31) /v[0]; // divisor here. 
k = (k*b + u[j]) - q[j]*v[0]; 
} 
if (р != NULL) r[0] = К; 


return 0; 


// Normalize by shifting v left just enough so that 
// its high-order bit is on, and shift u left the 
// same amount. We may have to append a high-order 
// digit on the dividend; we do that unconditionally. 
s = nlz(v[n-1]) - 16; // 0 <= s <= 16. 
vn = (unsigned short *)alloca(2*n); 
for (i =n - 1; i > 0; i--) 
vn[i] = (v[i] << s) | (v[i-1] >> 16-5); 
vn[0] = v[0] << s; 
un = (unsigned short *)alloca(2* (m + 1)); 
un[m] = u[m-1] >> 16-s; 
for (i =m - 1; i > 0; i--) 
un[i] = (u[i] << s) (u[i-1] >> 16-s); 
un[0] = u[0] << s; 
for (1 = m = ñ; j) >= 0; j--) í // Main loop. 
// Compute estimate qhat of q[j]. 
qhat (un[j+n]*b + un[j*n-1])/vn[n-1]; 
rhat (un[j*n]*b + un[jtn-1]) - ghat*vn[n-1]; 
again: 
if (qhat >= b || ghat*vn[n-2] > b*rhat + un[jt+n-2]) 


{ ghat = ghat - 1; 

rhat = rhat + vn[n-1]; 

if (rhat < b) goto again; 
} 


// Multiply and subtract. 


k = 0; 
for (i = 0; i < n; itt) í 
р = ghat*vn[i]; 
Е = un[itj] - К - (p & OxFFFF); 
ип [1+9] = t; 
К = (р >> 16) = (t >> 16); 
Е = un[j*n] - k; 
un[j*tn] = t; 

q[j] = qhat; // Store quotient digit. 
if (t < 0) { // If we subtracted too 
а[3] = 451 - 1; // much, add back. 

k = 0; 

for (i = 0; 1 < n; itt) { 
t = un[i*j] + vn[i] + k; 
un[itj] = t; 
К = + >> 16; 

} 

un[j-*n] un [j-*n] k; 

] 
) // End j. 


// If the caller wants the remainder, unnormalize 
// it and pass it back. 
if (r != NULL) { 
for (і = 0; i < n; itt) 
r[i] = (un[i] >> s) | (un[i + 1] << 16-s); 


return 0; 


FIGURE 9-1. Multiword integer division, unsigned. 


The algorithm processes its inputs and outputs a halfword at 
a time. Of course, we would prefer to process a fullword at a 
time, but it seems that such an algorithm would require an 
instruction that does 64 £32 — 32 division. We assume here 
that either the machine does not have that instruction or it is 
hard to access from our high-level language. Although we 


generally assume the machine has 322312312932 division, for 


this problem 32 + 16 => 16 suffices. 


Thus, for this implementation of Knuth's algorithm, the base 
b is 65536. See [Knu2] for most of the explanation of this 
algorithm. 


The dividend u and the divisor v are in “little-endian” order 
—that is, uto] and vto] are the least significant digits. (The 
code works correctly on both big- and little-endian machines.) 
Parameters m and n are the number of halfwords in u and v, 
respectively (Knuth defines n to be the length of the quotient). 
The caller supplies space for the quotient а and, optionally, for 
the remainder r. The space for the quotient must be at least m - 
n + 1 halfwords, and for the remainder, n halfwords. 
Alternatively, a value of Nurr can be given for the address of the 
remainder to signify that the remainder is not wanted. 


The algorithm requires that the most significant digit of the 
divisor, vin - 11, be nonzero. This simplifies the normalization 
steps and helps to ensure that the caller has allocated sufficient 
space for the quotient. The code checks that vin - 1] is 
nonzero, and also the requirements that n > 1 and m > n. If any 
of these conditions are violated, it returns with an error code 
(return value 1). 


After these checks, the code performs the division for the 
simple case in which the divisor is of length 1. This case is not 
singled out for speed; the rest of the algorithm requires that the 
divisor be of length 2 or more. 


If the divisor is of length 2 or more, the algorithm normalizes 
the divisor by shifting it left just enough so that its high-order 
bit is 1. The dividend is shifted left the same amount, so the 
quotient is not changed by these shifts. As explained by Knuth, 
these steps are necessary to make it easy to guess each quotient 
digit with good accuracy. The number of leading zeros function, 
nlz(x), is used to determine the shift amount. 


In the normalization steps, new space is allocated for the 
normalized dividend and divisor. This is done because it is 
generally undesirable, from the caller’s point of view, to alter 
these input arguments, and because it may be impossible to alter 
them—they may be constants in read-only memory. 
Furthermore, the dividend may need an additional high-order 
digit. C’s “alloca” function is ideal for allocating this space. It is 
usually implemented very efficiently, requiring only two or three 
in-line instructions to allocate the space and no instructions at 


all to free it. The space is allocated on the program's stack, in 
such a way that it is freed automatically upon subroutine return. 


In the main loop, the quotient digits are cranked out one per 
loop iteration, and the dividend is reduced until it becomes the 
remainder. The estimate qnat of each quotient digit, after being 
refined by the steps in the loop labeled again, is always either 
exact or too high by 1. 


The next steps multiply ghat by the divisor and subtract the 
product from the current remainder, as in the grade-school 
method. If the remainder is negative, it is necessary to decrease 
the quotient digit by 1 and either re-multiply and subtract or, 
more simply, adjust the remainder by adding the divisor to it. 
This need be done at most once, because the quotient digit was 
either exact or 1 too high. 


Lastly, the remainder is given back to the caller if the address 
of where to put it is non-null. The remainder must be shifted 
right by the normalization shift amount s. 


The *add back" steps are executed only rarely. To see this, 
observe that the first calculation of each estimated quotient digit 
ghat is done by dividing the most significant two digits of the 
current remainder by the most significant digit of the divisor. 
The steps in the “again” loop amount to refining qnat to be the 
result of dividing the most significant three digits of the current 
remainder by the most significant two digits of the divisor (proof 
omitted; convince yourself of this by trying some examples using 
b = 10). Note that the divisor is greater than or equal to 5/2 
(because of normalization), and the dividend is less than or 
equal to » times the divisor (because each remainder is less than 
the divisor). 


How accurate is the quotient estimated by using only three 
dividend digits and two divisor digits? Because normalization 
was done, it can be shown to be quite accurate. To see this 
somewhat intuitively (not a formal proof), consider estimating u 
/ v in this way for base ten arithmetic. It can be shown that the 
estimate is always high (or exact). Thus, the worst case occurs if 
truncation of the divisor to two digits decreases the divisor by as 
much as possible in the sense of relative error, and truncation of 
the dividend to three digits decreases it by as little as possible 
(which is 0), and if the dividend is as large as possible. This 
occurs for the case 49900...0/5099...9, which we estimate by 


499/50 = 9.98. The true result is approximately 499/51 = 
9.7843. The difference of 0.1957 reveals that the estimated 
quotient digit and the true quotient digit, which are the floor 
functions of these ratios, will differ by at most 1, and this will 
occur about 2096 of the time (assuming the quotient digits are 
uniformly distributed). This, in turn, means that the *add back" 
steps will be executed about 20% of the time. 


Carrying out this (non-rigorous) analysis for a general base b 
yields the result that the estimated and true quotients differ by 
at most 2 / b. For b — 65536, we again obtain the result that the 
difference between the estimated and true quotient digits is at 
most 1, and this occurs with probability 2/65536 = 0.00003. 
Thus, the *add back" steps are executed for only about 0.00396 
of the quotient digits. 


An example that requires the add back step is, in decimal, 
4500/501. A similar example for base 65536 is Ox7FFF8000 
00000000/0x8000 00000001. 


We will not attempt to estimate the running time of this 
entire program, but simply note that for large m and n, the 
execution time is dominated by the multiply/subtract loop. On a 
good compiler this will compile into about 16 basic RISC 
instructions, one of which is multiply. The “for j” loop is 
executed m — п + 1 times, and the multiply/subtract loop п 
times, giving an execution time for this part of the program of 
(15 + тиђп(т - п + 1) cycles, where mul is the time to 
multiply two 16-bit variables. The program also executes m - n 
+ 1 divide instructions and one number of leading zeros 
instruction. 


Signed Multiword Division 


We do not give an algorithm specifically for signed multiword 
division, but merely point out that the unsigned algorithm can 
be adapted for this purpose as follows: 


1. Negate the dividend if it is negative, and similarly for the 
divisor. 


2. Convert the dividend and divisor to unsigned 
representation. 


3. Use the unsigned multiword division algorithm. 
4. Convert the quotient and remainder to signed 


representation. 


5. Negate the quotient if the dividend and divisor had 
opposite signs. 


6. Negate the remainder if the dividend was negative. 


These steps sometimes require adding or deleting a most 
significant digit. For example, assume for simplicity that the 
numbers are represented in base 256 (one byte per digit), and 
that in the signed representation, the high-order bit of the 
sequence of digits is the sign bit. This is much like ordinary 
two's-complement representation. Then, a divisor of 255, which 
has signed representation OxOOFF, must be shortened in step 2 to 
OxFF. Similarly, if the quotient from step 3 begins with a 1-bit, it 
must be provided with а leading 0-Буе for correct 
representation as a signed quantity. 


9-3 Unsigned Short Division from Signed Division 


By "short division" we mean the division of one single word by 
another (e.g., 32 + 32 = 32). It is the form of division provided 
by the “/” operator, when the operands are integers, in C and 
many other high-level languages. C has both signed and 
unsigned short division, but some computers provide only signed 
division in their instruction repertoire. How can you implement 
unsigned division on such a machine? There does not seem to be 
any really slick way to do it, but we offer some possibilities here. 


Using Signed Long Division 


Even if the machine has signed long division (64+32 > 32), 
unsigned short division is not as simple as you might think. In 
the XLC compiler for the IBM RS/6000, it is implemented as 


illustrated below for q < (n + 4). 
if n = d then q < 0 
else if d = 1 then q < n 
else if d € 1 then q < 1 
else 4 < (0 || n) - d 


u 31 
The third line is really testing to see if d22^". if d is 
algebraically less than or equal to 1 at this point, then because it 


is not equal to 1 (from the second line), it must be algebraically 
less than or equal to 0. We don't care about the case d = 0, so 
for the cases of interest, if the test on the third line evaluates to 


u 31 
true, the sign bit of d is on. that is, d 2 2^. Because from the 


first line TP EC that H > , and because n cannot exceed 


232-1,n id = 

The notation on the fourth line means to form the double- 
length integer consisting of 32 O-bits followed by the 32-bit 
quantity n, and divide it by d. The test for d — 1 (second line) is 
necessary to ensure that this division does not overflow (it 


X 531 
would overflow if " 2 ^ , and then the quotient would be 
undefined). 

By commoning the comparisons on the second and third 
lines,3 the above can be implemented in 11 instructions, three of 
which are branches. If it is necessary that the divide be executed 
when d = 0, to get the overflow interrupt, then the third line 
can be changed to “else if d « O then q < 1," giving a 12- 
instruction solution on the RS/6000. 

It is a simple matter to alter the above code so that the 

2<d 231) 
probable usual cases (2 <d = do not ү? through so many 
tests (begin with if d < 1 ...), but the code volume increases 
slightly. 


Using Signed Short Division 


This section is written for a 32-bit machine, but it applies to a 
64-bit machine (that is, getting unsigned 64+64 > 64 division 
from the same form of signed division) by changing all 
occurrences of 31 to 63. It can be used to get unsigned division 
in Java, which lacks unsigned integers. 

If signed long division is not available, but signed short 
division is, then Л 24 can be implemented by somehow 
reducing the problem to the case n, d < 23! and using the 


и 431 
machine’s divide instruction. If d = 2 , then И 24 can only be 
O or 1, so this case is easily dispensed with. Then, we can reduce 
the dividend by using the fact that the expression ( 
и . 
(л +2)+4)ух2 approximates И + d with an error of only 0 
or 1. This leads to the following method: 


1, ifd «O0 then if n = d then q < 0 


else q < 1 


ы 


3. else do 

4. q€(n£2)d)x2 
3. r«-n—qd 

6 ifr>dtheng<qt+1 
7 


end 
The test d < 0 on line 1 is really testing to determine if 
d и 231 d u 231 А А 
2 . £2 ^ , then the largest the quotient could be is 
(232 1) + 231 = 1, so the first two lines compute the correct 
quotient. 


Line 4 represents the code shift right unsigned 1, divide, shift 
и? u 231 : . (024293 

left 1. Clearly, И + = = , and at this point" ~ as well, so 

these quantities can be used in the computer's signed division 
instruction. (If d = 0, overflow will be signaled here.) 


The estimate computed at line 4 is 


a= σα ος оу тт м. 
where we have used the corollary of Theorem D3. Line 5 
computes the remainder corresponding to the estimated 
quotient. It is 

= y Em da 


Thus, 0 < r < 2d. Іг < d, then q is the correct quotient. If r > 
d, then adding 1 to q gives the correct quotient (the program 
must use an unsigned comparison here, because of the 
possibility that r 2 231). 


By moving the load immediate of 0 into 4 ahead of the 


= rem(n, 2d). 


comparison ” “ d, and coding the assignment q — 1 in line 2 as 
a branch to the assignment q <= 9 + 1 in line 6, this can be 


coded in 14 instructions on most machines, four of which are 
branches. It is straightforward to augment the code to produce 


the remainder as well: to line 1 append г = п, to line 2 append г 
< n - d, and to the “then” clause in line 6 append г — г - d. 
(Or, at the cost of a multiply, simply append r — n - qd to the 
end of the whole sequence.) 


An alternative for lines 1 and 2 is 


if n = d then q < 0 
else if d < 0 then д < 1, 


which can be coded a little more compactly, for a total of 13 
instructions, three of which are branches. But it executes more 
instructions in what is probably the usual case (small numbers 
with n d). 


Using predicate expressions, the program can be written 


1, ifd<0 then д < (п> d) 
2. else do 

3 4 < ((n22)-d)x2 
4. r«-n-—qd 

5 q<q+(r d) 

6. end 


which saves two branches if there is a way to evaluate the 
predicates without branching. On the basic RISC they can be 
evaluated in one instruction (cMPGEU); on MIPS they take two 
(SLTU, ховт). On most computers, they can be evaluated in four 
instructions each (three if equipped with a full set of logic 


„Ц. a, 7 К 
instructions), by using the expression for Х 5.7 given in 
“Comparison Predicates” on page 23, and simplifying because on 
line 1 of the program above it is known that 431 = 1, and on 
line 5 it is known that 431 = 0. The expression simplifies to 


nid = (п& —(п-4)) >31 online 1, and 


rsd = (т | —(г-а)) >31 online 5. 
We can get branch-free code by forcing the dividend to be 0 


u 3l . 
when d 2 . Then, the divisor can be used in the machine's 


signed divide instruction, because when it is misinterpreted as a 
negative number, the result is set to 0, which is within 1 of 
being correct. We'll still handle the case of a large dividend by 
shifting it one position to the right before the division, and then 
shifting the quotient one position to the left after the division. 
This gives the following program (ten basic RISC instructions): 


l. #=4>31 
2. n'«-n&c-t 
3. q«-((n' £2)+d)x2 
4. r«-n—qd 
5. q«€-q*(r2d) 
9-4 Unsigned Long Division 


By "long division" we mean the division of a doubleword by a 
single word. For a 32-bit machine, this is 64 #32 > 32 
division, with the result unspecified in the overflow cases, 
including division by 0. 

Some 32-bit machines provide an instruction for unsigned 
long division. Its full capability, however, gets little use, because 
only 32 £32 = 32 division is accessible with most high-level 
languages. Therefore, a computer designer might elect to 
provide only 32 £32 division and would probably want an 
estimate of the execution time of a subroutine that implements 
the missing function. Here we give two algorithms for providing 
this missing function. 


Hardware Shift-and-Subtract Algorithms 


As a first attempt at doing long division, we consider doing what 
the hardware does. There are two algorithms commonly used, 
called restoring and nonrestoring division [H&P, sec. A-2; EL]. 
They are both basically “shift-and-subtract” algorithms. In the 
restoring version, shown below, the restoring step consists of 
adding back the divisor when the subtraction gives a negative 
result. Here x, y, and z are held in 32-bit registers. Initially, the 
double-length dividend is x || y, and the divisor is z. We need a 


single-bit register c to hold the overflow from the subtraction. 


do i < 1 to 32 


с |х|у< 2 || у) // Shift left one. 
c || x < (c || x) - (050 || z) // Subtract (33 bits). 
y| € с // Set one bit of quotient. 


if c then с || x < (с||х) + (060 ||) //Restore. 
епа 


Upon completion, the quotient is in register у and the remainder 
is in register x. 


The algorithm does not give a useful result in the overflow 
cases. For division of the doubleword quantity x || y by 0, the 
quotient obtained is the one's-complement of x, and the 
remainder obtained is y. In particular, 9 #0 => 2?? 1 rem 0. 
The other overflow cases are difficult to characterize. 

It might be useful if, for nonzero divisors, the algorithm 
would give the correct quotient modulo 232, and the correct 
remainder. The only way to do this seems to be to make the 
register represented by c || x || y above 97 bits long, and do the 
loop 64 times. This is doing 64 #32 => 64 division. The 
subtractions would still be 33-bit operations, but the additional 
hardware and execution time make this refinement probably not 
worthwhile. 


This algorithm is difficult to implement exactly in software, 
because most machines do not have the 33-bit register that we 
have represented by c || x. Figure 9-2, however, illustrates a 
shift-and-subtract algorithm that reflects the hardware algorithm 
to some extent. 


The variable c is used for a device to make the comparison 
come out right. We want to do a 33-bit comparison after shifting 
x || у. If the first bit of x is 1 (before the shift), then certainly 
the 33-bit quantity is greater than the divisor (32 bits). In this 
case, x | t is all 1’s, so the comparison gives the correct result 
(true). On the other hand, if the first bit of x is 0, then a 32-bit 
comparison is sufficient. 


The code of the algorithm in Figure 9-2 executes in 321 to 
385 basic RISC instructions, depending upon how often the 
comparison is true. If the machine has shift left double, the 


shifting operation can be done in one instruction, rather than the 
four used above. This would reduce the execution time to about 
225 to 289 instructions (we are allowing two instructions per 
iteration for loop control). 


The algorithm in Figure 9-2 can be used to do 
32232 = 32 division by supplying x = 0. The only 
simplification that results is that the variable + can be omitted, 
as its value would always be 0. 


Click here to view code image 


unsigned divlu(unsigned x, unsigned y, unsigned z) { 
// Divides (x || y) by z. 
ant a; 
unsigned t; 


for (і = 1; i <= 32; 1++) { 
t = (int)x >> 31; // All 1's ας και = 1. 
x = (x<< 1) | (y >> 31); // Shift x || y left 
у = у<< 1; // one bit. 
if ((x | t) >= Z) { 
x = X = 7; 
у= у +1; 
} 
} 
return y; // Remainder is x. 


FIGURE 9-2. Divide long unsigned, shift-and-subtract 
algorithm. 


On the next page is the nonrestoring hardware division 
algorithm (unsigned). The basic idea is that, after subtracting 
the divisor z from the 33-bit quantity that we denote by c || x, 
there is no need to add back z if the result was negative. Instead, 
it suffices to add on the next iteration rather than subtract. This 
is because adding z (to correct the error of having subtracted z 
on the previous iteration), shifting left, and subtracting z is 
equivalent to adding z(2(u + z) - z = 2u + 2). The advantage 
to hardware is that there is only one add or subtract operation 
on each loop iteration, and the adder is likely to be the slowest 
circuit in the loop.4 An adjustment to the remainder is needed at 
the end if it is negative. (No corresponding adjustment of the 
quotient is required.) 


The input dividend is the doubleword quantity x || y, and the 
divisor is z. Upon completion, the quotient is in register y and 
the remainder is in register x. 


c-0 
do і < 1 to 32 
ifc — 0 then do 
c || x || y — 2(х || y) // Shift left one. 
сх «< (с |х) (050 || 2) //Subtract divisor. 
епа 
else do 
c||x || y — 2(x || y) // Shift left one. 
с |х<= (с|| х) + (050 || z) // Add divisor. 
end 
yo € —с // Set one bit of quotient. 
end 
ifc = 1thenx<«<x+z // Adjust remainder if negative. 


This does not seem to adapt very well to a 32-bit algorithm. 


The 801 minicomputer (an early experimental RISC machine 
built by IBM) had a divide step instruction that essentially 
performed the steps in the body of the loop above. It used the 
machine’s carry status bit to hold c and the MQ (a 32-bit 
register) to hold y. A 33-bit adder/subtracter is needed for its 
implementation. The 801’s divide step instruction was a little 
more complicated than the loop above, because it performed 
signed division and it had an overflow check. Using it, a division 
subroutine can be written that consists essentially of 32 
consecutive divide step instructions followed by some 
adjustments to the quotient and remainder to make the 
remainder have the desired sign. 


Using Short Division 


An algorithm for 64 £32 = 32 division can be obtained from 
the multiword division algorithm of Figure 9-1 on page 185, by 


specializing it to the case т = 4, п = 2. Several other changes 
are necessary. The parameters should be fullwords passed by 


value, rather than arrays of halfwords. The overflow condition is 
different; it occurs if the quotient cannot be contained in a single 
fullword. It turns out that many simplifications to the routine 
are possible. It can be shown that the guess qhat is always 
exact; it is exact if the divisor consists of only two halfword 
digits. This means that the *add back" steps can be omitted. If 
the “main loop" of Figure 9-1 and the loop within it are 
unrolled, some minor simplifications become possible. 


The result of these transformations is shown in Figure 9-3. 
The dividend is in u1 and uo, with ui containing the most 
significant word. The divisor is parameter v. The quotient is the 
returned value of the function. If the caller provides a non-null 
pointer in parameter r, the function will return the remainder in 
the word to which r points. 


For an overflow indication, the program returns a remainder 
equal to the maximum unsigned integer. This is an impossible 
remainder for a valid division operation, because the remainder 
must be less than the divisor. In the overflow case, the program 
also returns a quotient equal to the maximum unsigned integer, 
which may be an adequate indicator in some cases in which the 
remainder is not wanted. 


The strange expression (-s >> 31) in the assignment to un32 
is supplied to make the program work for the case s - 0 on 
machines that have mod 32 shifts (e.g., Intel x86). 


Experimentation with uniformly distributed random numbers 
suggests that the bodies of the “again” loops are each executed 
about 0.38 times for each execution of the function. This gives 
an execution time, if the remainder is not wanted, of about 52 
instructions. Of these instructions, one is number of leading zeros, 
two are divide, and 6.5 are multiply (not counting the 
multiplications by », which are shift's). If the remainder is 
wanted, add six instructions (counting the store of r), one of 
which is multiply. 

What about a signed version of diviu? It would probably be 
difficult to modify the code of Figure 9-3, step by step, to 
produce a signed variant. That algorithm, however, can be used 
for signed division by taking the absolute value of the 
arguments, running divlu, and then complementing the result if 
the signs of the original arguments differ. There is no problem 
with extreme values such as the maximum negative number, 


because the absolute value of any signed integer has a correct 
representation as an unsigned integer. This algorithm is shown 
in Figure 9-4. 

It is hard to devise really good code to detect overflow in the 
signed case. The algorithm shown in Figure 9-4 makes a 
preliminary determination identical to that used by the unsigned 
long division routine, which ensures that |u / v| « 232. After 
that, it is necessary only to ensure that the quotient has the 
proper sign or is O. 


Click here to view code image 


unsigned divlu(unsigned ul, unsigned u0, unsigned v, 
unsigned *r) { 


const unsigned b = 65536; // Number base (16 
bits). 
unsigned unl, опо, // Norm. dividend 
LSD's. 
vnl, упо, 77 Norm. divisor 
digits. 
qi, q0, // Quotient digits. 
un32, απ], 191010, // Dividend digit 
pairs. 
rhat; // A remainder. 
inb.-.$; // Shift amount for 
norm. 


// If overflow, set 


rem 
if (r !- NULL) // to an impossible 
value, 
Хү = ΟΧΕΕΕΕΕΕΕΕ; // and return the 
largest 
return OxFFFFFFFF;) // possible quotient. 
S — nlz(v); // 0 <= s <= 31. 
у = у << s; // Normalize divisor. 
vnl = v >> 16; // Break divisor up 
into 


vn0 = v & OxFFFF; 


un32 = (ul << s) | (00 >> 32 = в) 
unl0 = u0 << s; 
left. 


unl = unlO >> 16; 
un0 = unlO & OxFFFF; 


// two 16-bit digits. 
& (-s >> 31); 


// Shift dividend 


// Break right half of 
// dividend into two 


digits. 


ql = un32/vnl; // Compute the first 

rhat = un32 - ql*vnl; // quotient digit, ql. 
againl: 

if (ql >= b || а1*уп0 > b*rhat + unl) { 


91 = а1 = 1; 
rhat = rhat + vnl; 
if (rhat < b) goto адаіпі; } 


un21 = un32*b + unl - ql*v; // Multiply and 
subtract. 
а0 = un21/vnl; // Compute the second 
rhat = un21 - qO*vnl; // quotient digit, 
qo. 
again2: 
if (q0 >= b || q0*vn0 > b*rhat + απο) { 
q0 = q0 - 1; 
rhat = rhat + vnl; 
if (rhat < b) goto адаіп2; } 
if (х != NULL) // If remainder is 
wanted, 
*r = (un21*b + un0 - а40Ху) >> s; // return it. 


return ql*b + q0; 


FIGURE 9-3. Divide long unsigned, using fullword division 
instruction. 


Click here to view code image 


int divls(int ul, unsigned u0, int v, int *r) { 
int q, uneg, vneg, diff, borrow; 


uneg = ul >> 31; // -1 if u « 0. 

if (uneg) { // Compute the absolute 
u0 = -u0; // value of the dividend u. 
borrow = (uO != 0); 
ul = -ul = borrow; } 

vneg = v >> 31; // -1 if у < 0. 

у = (v^ vneg) - vneg; // Absolute value of v. 


if ((unsigned)ul >= (unsigned)v) goto overflow; 


q = divlu(ul, иб, v, (unsigned *)r); 


diff - uneg ^ vneg; // Negate q if signs of 
а = (q^ diff) - diff; // u and v differed. 


if (uneg && r != NULL) 
*r = -*r; 
if ((diff ^ а) < 0 68 q != 0) { // ТЕ overflow, 
overflow: // set remainder 
if (r != NULL) // to an impossible value, 
*r — 0x80000000; // and return the largest 
q = 0x80000000; } // possible neg. quotient. 
return q; 


} 


FIGURE 9-4. Divide long signed, using divide long unsigned. 
9-5 Doubleword Division from Long Division 


This section considers how to do 64 + 64 = 64 division from 64 
+ 32 = 32 division, for both the unsigned and signed cases. The 
algorithms that follow are most suited to a machine that has an 
instruction for long division (64 + 32), at least for the unsigned 
case. It is also helpful if the machine has the number of leading 
zeros instruction. The machine may have either 32-bit or 64-bit 
registers, but we will assume that if it has 32-bit registers, then 
the compiler implements basic operations such as adds and shifts 
on 64-bit operands (the “long long" data type in C). 


These functions are known as ^  udivdi3" and “  divdi3" in 
the GNU C world, and similar names are used here. 


Unsigned Doubleword Division 

A procedure for this operation is shown in Figure 9—5. 
Click here to view code image 

unsigned long long udivdi3 (unsigned long long u, 


unsigned long long v) { 


unsigned long long u0, ul, vl, q0; ql, К, n; 


if (v >> 32 == 0) í // If v « 2**32: 
if (u >> 32 « v) // If u/v cannot overflow, 
return DIVU(u, v) // just do one division. 


& OxFFFFFFFF; 
else ( // If u/v would overflow: 


ul = u >> 32; // Break u up into two 
u0 = u & OxFFFFFFFF; // halves. 


gl = DIVU(ul, v) // First quotient digit. 
& OxFFFFFFFF; 
k = ul - ql*v; // First remainder, < v. 


q0 = DIVU((k << 32) + u0, v) // 2nd quot. digit. 
& OxFFFFFFFF; 
return (ql << 32) + q0; 


// Here v >= 2**32. 


n = nlz64(v); /] Ὁ <= тс 3L. 
vl = (v<< п) >> 32; // Normalize the divisor 
// so its MSB is 1. 
ul =u >> 1; // To ensure no overflow. 
ql = DIVU(ul, vl) // Get quotient from 
8 OxFFFFFFFF; // divide unsigned insn. 
q0 = (αἱ << п) >> 31; // Undo normalization and 
// division of u by 2. 
if (40 != 0) // Make q0 correct or 
q0 = q0 - 1; // too small by 1. 
if ((u - q0*v) >= v) 
q0 = q0 + 1; // Now q0 is correct. 


return q0; 


FIGURE 9-5. Unsigned doubleword division from long 
division. 

This code distinguishes three cases: (1) the case in which a 
single execution of the machine's unsigned long division 
instruction (DIVU) can be used, (2) the case in which (1) does 
not apply, but the divisor is a 32-bit quantity, and (3) the cases 
in which the divisor cannot be represented in 32 bits. It is not 
too hard to see that the above code is correct for cases (1) and 
(2). For case (2), think of the grade-school method of doing long 
division. 

Case (3), though, deserves proof, because it is very close to 
not working in some cases. Notice that in this case only a single 
execution of DIVU is needed, but the number of leading zeros and 
multiply operations are needed. 


For the proof, we need these basics (for integer variables): 


LLa/b ]/d] = La/(bd) | (2) 


bLa/b| = a—rem(a, b) (3) 


From the first line in the section of the procedure of interest 
(we assume that v = 0), 


О = п < 31. 


In computing vi, the left shift clearly cannot overflow. 
Therefore, 


Lv/232-"], and 
[м/2 |. 


In computing qi, uj and νι are in range for the DIVU 
instruction and it cannot overflow. Hence, 


γι 


и, 


qı = (u1 / νι. 


In the first computation of qo, the left shift cannot overflow 
because 41 < 232 (because the maximum value of ші is 263 - 1 
and the minimum value of νι is 231). Therefore, 


qo = 191/281 7n. 
Now, for the main part of the proof, we want to show that 
[ИГУ | < 90 = 1и/Уу + 1, 


which is to say, the first computation of qo is the desired result 
or is that plus 1. 


Using Equation (2) twice gives 


Е u 
qo i ат 
232 — "у, 


! 
~ 
= 


Using Equation (3) gives 


2 и 
| 
x Ё — rem(v, 232 5 


Using algebra to get this in the form u / v + something: 


qo = |+ rem(v, 232-” 
0 3 E 
v vw(v-rem(vy, 22-7) 


Kad 
y i 


and we will now show that 8 < 1. 


This is of the form 


8 is largest when rem(v, 232 - n) is as large as possible and, 
given that, when v is as small as possible. The maximum value of 
гет(у, 232 - n) is 222 - n - 1. Because of the way n is defined in 
terms of v, v => 263 - n. Thus, the smallest value of v having that 
remainder is 


263-п + 232-n. 1, 
Therefore, 


5< u(2?? -" — 1) 
| (263 п + 232 "_ 1)263 п 


NA *-1y 
(263 -пү? 


By inspection, for n in its range of 0 to 31, 


Е. 


Since и is at most 264 — 1, 8 < 1. Because fo = Lu/v * 8 ] and 
6 « 1 (and obviously 8 = 0), 


H «45 Н +1. 
v у 


To correct this result by subtracting 1 when necessary, we 
would like to code 


Click here to view code image 
if (а < а0*у) а0 = 40 = 1; 


(i.e., if the remainder и - доу is negative, subtract 1 from qo). 
However, this doesn't quite work, because qo v can overflow 
(e.g., for u = 264 _ 1 and v = 232 + 3). Instead, we subtract 1 
from qo, so that it is either correct or too small by 1. Then qo v 
will not overflow. We must avoid subtracting 1 if qo = 0 (if qo 
— 0, it is already the correct quotient). 


Then the final correction is: 
Click here to view code image 
if ((u- а0*у) >= у) g0 = а0 = 1; 


To see that this is a valid computation, we already noted that 
qov does not overflow. It is easy to show that 


0 = u-qov < 2v. 


If v is very large (2 263), can the subtraction overflow by trying 
to produce a result greater than v? No, because u « 294 and qov 
= 0. 


Incidentally, there are alternatives to the lines 
Click here to view code image 


if (40 != 0) // Make qO correct or 
q0 = q0 - 1 // too small by 1. 


that may be preferable on some machines. One is to replace 
them with 


Click here to view code image 
if (q0 == 0) return 0; 


Another is to place at the beginning of this section of the 
procedure, or at the beginning of the whole procedure, the line 


Click here to view code image 


if (u < v) return 0; // Avoid a problem later. 


These alternatives are preferable if branches are not costly. 
The code shown in Figure 9-5 works well if the machine's 
comparison instructions produce a 0/1 integer result in a general 
register. Then, the compiler can change it to, in effect, 


Click here to view code image 
q0 = q0 - (q0 != 0); 


(or you can code it that way if your compiler doesn't do this 
optimization). This is just a compare and subtract on such 
machines. 


Signed Doubleword Division 


In the signed case, there seems to be no better way to do 
doubleword division than to divide the absolute values of the 
operands, using function udivai3, and then negate the sign of 
the quotient if the operands have different signs. If the machine 
has a signed long division instruction, which we designate here 
as DIVS, then it may be advantageous to single out the cases in 
which DIVS can be used rather than invoking udivai3. This 
presumes that these cases are common. Such a function is shown 
in Figure 9-6. 

The “#define” in the code in Figure 9-6 uses the GCC facility 
of enclosing a compound statement in parentheses to construct 
an expression, a facility that most C compilers do not have. 
Some other compilers may have 11арѕ (x) as a built-in function. 


Click here to view code image 


#define llabs(x) \ 
({unsigned long long t = (x) >> 63; ((x) ^t) - t;}) 


long long divdi3(long long u, long long v) { 


unsigned long long au, av; 
long long q, t; 


au = llabs(u); 


av = llabs(v); 
if (av >> 31 == 0) { // If |v| < 2**31 and 
if (au « av<< 31) { // |ul/|v| cannot 
q = DIVS(u, v); // overflow, use DIVS. 


return (q<< 32) >> 32; 


} 

q = au/av; // Invoke udivdi3. 

t= (и ^ у) >> 63; // If u, v have different 
return (q^ t) - t; // signs, negate q. 


FIGURE 9-6. Signed doubleword division from unsigned 
doubleword division. 


The test that v is in range is not precise; it misses the case in 
which у = --251. If it is important to use the DIVS instruction in 
that case, the test 


Click here to view code image 
if ((v<< 32) >> 32 == v) { // If v is in range and 


can be used in place of the third executable line in Figure 9-6 
(at a cost of one instruction). Similarly, the test that |u| / |v| 
cannot overflow is simplified and a few "corner cases" will be 
missed; the code amounts to using ὃ = 0 in the signed division 
overflow test scheme shown in "Division" on page 34. 


Exercises 


1. Show that for real x, [x | = - [- x 1. 


2. Find branch-free code for computing the quotient and 
remainder of modulus division on a basic RISC that has 
division and remainder instructions for truncating 
division. 

3. Similarly, find branch-free code for computing the 
quotient and remainder of floor division on a basic RISC 
that has division and remainder instructions for 
truncating division. 


4. How would you compute Гп / d | for unsigned integers n 
and d, 0 < n < 232 - 1 and 1 < d < 232 - 1? Assume 
your machine has an unsigned divide instruction that 
computes |n / d |. 


5. Theorem D3 states that for x real and d an integer, | |x| / 
d| = (x / dj. Show that, more generally, if a function f(x) 
is (a) continuous, (b) monotonically increasing, and (c) 
has the property that if f(x) is an integer then x is an 


integer, then |х) = 1) [GKP]. 


Chapter 10. Integer Division By 
Constants 


On many computers, division is very time consuming and is to 
be avoided when possible. A value of 20 or more elementary add 
times is not uncommon, and the execution time is usually the 
same large value even when the operands are small. This 
chapter gives some methods for avoiding the divide instruction 
when the divisor is a constant. 


10-1 Signed Division by a Known Power of 2 


Apparently, many people have made the mistake of assuming 
that a shift right signed of k positions divides a number by 2k, 
using the usual truncating form of division [GLS2]. It's a little 
more complicated than that. The code shown below computes q 
= п + 2k, for 1 < К < 31 [Hop]. 


Click here to view code image 


shrsi t,n,k-1 Form the integer 

shri t,t,32-k 2**k — 1 if п < 0, else 0. 
add t,n,t Add it to n, 

Sh£zsSr ЗЕ, Е and shift right (signed). 


It is branch free. It simplifies to three instructions in the 
common case of division by 2 (k — 1). It does, however, rely on 
the machine's being able to shift by a large amount in a short 
time. The case k = 31 does not make too much sense, because 
the number 23! is not representable in the machine. 
Nevertheless, the code does produce the correct result in that 
case (which isq = -1 ifn = —231and q = 0 for all other n). 


To divide by -2k, the above code can be followed by a negate 
instruction. There does not seem to be any better way to do it. 


The more straightforward code for dividing by 2k is 
Click here to view code image 
bge n,label Branch if n >= 0. 


addi n,n,2**k-1 Add 2**k - 1 to n, 
label shrsi n,n,k and shift right (signed). 


This would be preferable on a machine with slow shifts and fast 
branches. 


PowerPC has an unusual device for speeding up division by a 
power of 2 [GGS]. The shift right signed instructions set the 
machine's carry bit if the number being shifted is negative and 
one or more 1-bits are shifted out. That machine also has an 
instruction for adding the carry bit to a register, denoted aaaze. 
This allows division by any (positive) power of 2 to be done in 
two instructions: 


Click here to view code image 


shrsi q,n,k 
addze q,q 


A single snrsi of k positions does a kind of signed division 
by 2k that coincides with both modulus and floor division. This 
suggests that one of these might be preferable to truncating 
division for computers and HLL's to use. That is, modulus and 
floor division mesh with shrsi better than does truncating 
division, permitting a compiler to translate the expression n / 2 
to an shrsi. Furthermore, snrsi followed by neg (negate) does 
modulus division by -2k, which is a hint that maybe modulus 
division is best. (This is mainly an aesthetic issue. It is of little 
practical significance, because division by a negative constant is 
no doubt extremely rare.) 


10-2 Signed Remainder from Division by a Known 
Power of 2 


If both the quotient and remainder of n + 2k are wanted, it is 


simplest to compute the remainder г from г = n - q * 2k This 
requires only two instructions after computing the quotient q: 


Click here to view code image 


shli r,q,k 
sub Enpi 


To compute only the remainder seems to require about four 
or five instructions. One way to compute it is to use the four- 
instruction sequence above for signed division by 2k, followed 


by the two instructions shown immediately above to obtain the 
remainder. This results in two consecutive shift instructions that 
can be replaced by an and, giving a solution in five instructions 
(four if k = 1): 


Click here to view code image 


shrsi t,n,k-1 Form the integer 

shri t,t,32-k 2**k - 1 if п < 0, else 0. 
add t,n,t Add it to n, 

andi t,t,-2**k clear rightmost k bits, 
sub r,n,t and subtract it from n. 


Another method is based on 


né&(2*—1), n20, 
—((—n) & (2*-1)), п<0. 


To use this, first compute f < И 3 l, and then 


г < ((abs(r) & (2к- 1) e t)-t 


rem(n, 2^) = 


(five instructions) or, for k = 1, since (-n) & 1 = n & 1, 
r-—(n&1)ot-t 


(four instructions). This method is not very good for k > 1 if the 
machine does not have absolute value (computing the remainder 
would then require six instructions). 


Still another method is based on 


n&(2*-1), n20, 


rem(n, 2^) - | 
((n + 24— 1) & (2*-1)) - Q*- 1), п<0. 


This leads to 
tc (n k-1)> 32—£ 
re((n+t) &(2k—1))—t 


(five instructions for k > 1, four for k = 1). 
The above methods all work for 1 < k < 31. 


Incidentally, if shift right signed is not available, the value that 
is 2k — 1 for n < 0 and 0 for = 0 can be constructed from 


t, «€- n 31 


which adds only one instruction. 


10-3 Signed Division and Remainder by Non- 
Powers of 2 


The basic trick is to multiply by a sort of reciprocal of the 
divisor d, approximately 232/d, and then to extract the leftmost 
32 bits of the product. The details, however, are more 
complicated, particularly for certain divisors such as 7. 


Let us first consider a few specific examples. These illustrate 
the code that will be generated by the general method. We 
denote registers as follows: 


n — the input integer (numerator) 
M — loaded with a “magic number” 
t - a temporary register 

q - will contain the quotient 

r - will contain the remainder 


Division by 3 


Click here to view code image 


1 М, 0x55555556 Load magic number, 
(2**3242) 3. 

mulhs q,M,n q = floor(M*n/2**32). 

shri t,n,31 Add 1 to q if 

add q,q,t n is negative. 

muli Έα 9 Compute remainder from 

sub F ny t r = п - q*3. 


Proof. The multiply high signed operation (muins) cannot 
overflow, as the product of two 32-bit integers can always be 
represented in 64 bits and muihs gives the high-order 32 bits of 
the 64-bit product. This is equivalent to dividing the 64-bit 
product by 232 and taking the floor of the result, and this is true 


whether the product is positive or negative. Thus, for n = 0 the 
above code computes 


-112532-2 | |n 2n 
q = | ——— | = - + —— |. 
3 232 3 3.232 


Now, n < 231, because 2?! - 1 is the largest representable 
positive number. Hence, the “error” term 2n / (3 : 232) is less 
than 1/3 (and is nonnegative), so by Theorem D4 (page 183) we 
have а = (n / 3j, which is the desired result (Equation (1) on 
page 182). 

For n < 0, there is an addition of 1 to the quotient. Hence 
the code computes 


q = 22 +2 п |4 | 2320+2п+3.222 | _ [232n+2n+1 1 
3 232 3.22 3.223 
where we have used Theorem D2. Hence 


EH 
„732° 4.232 
The error term is nonpositive and greater than -1 / 3, so by 


Theorem D4 q = [η / 31, which is the desired result (Equation 
(1) on page 182). 


This establishes that the quotient is correct. That the 
remainder is correct follows easily from the fact that the 
remainder must satisfy 


п = qd + r, 


the multiplication by 3 cannot overflow (because -231/3 « q 
< (231 - 1) / 3), and the subtract cannot overflow because the 
result must be in the range -2 to +2. 


The multiply immediate can be done with two add's, or a shift 
and an add, if either gives an improvement in execution time. 


On many present-day RISC computers, the quotient can be 
computed as shown above in nine or ten cycles, whereas the 
divide instruction might take 20 cycles or so. 


Division by 5 


For division by 5, we would like to use the same code as for 
division by 3, except with a multiplier of (232 + 4) / 5. 
Unfortunately, the error term is then too large; the result is off 
by 1 for about 1/5 of the values of n = 230 in magnitude. 
However, we can use a multiplier of (233 + 3) / 5 and add a 
shift right signed instruction. The code is 


Click here to view code image 


li M,0x66666667 Load magic number, 
(2**3343)/5. 

mulhs q,M,n q = floor(M*n/2**32). 

shrsi q,q,l 

shri t,n,31 Add 1 to q if 

add q,q,t n is negative. 

muli t,q,5 Compute remainder from 

sub r,n,t rn = q*5. 


Proof. The muins produces the leftmost 32 bits of the 64-bit 
product, and then the code shifts this right by one position, 
signed (or “arithmetically”). This is equivalent to dividing the 
product by 233 and then taking the floor of the result. Thus, for 
n > 0 the code computes 


Мн ЫГ 325 | 
д = |= | = |+ |. 
5 25| [5 5.233 


For 0 < n < 231, the error term 3n / 5 · 233 is nonnegative and 
less than 1/5, so by Theorem D4, q = | n / 5}. 


For n < 0, the above code computes 


33 
q = 2 Ton FX ES n,3n*l| 
5 2" 5 35-29 


The error term is nonpositive and greater than -1/5, so q = 
In / 51. 
That the remainder is correct follows as in the case of division 


by 3. 
The multiply immediate can be done with a shift left of two and an 
add. 


Division by 7 


Dividing by 7 creates a new problem. Multipliers of (232 + 3) / 
7 and (233 + 6) / 7 give error terms that are too large. A 
multiplier of (234 + 5) / 7 would work, but it's too large to 
represent in a 32-bit signed word. We can multiply by this large 
number by multiplying by (234 + 5) / 7 - 232 (a negative 
number), and then correcting the product by inserting an aaa. 
The code is 


Click here to view code image 


li M,0x92492493 Magic num, (2**34+5) /7 - 
2**32. 

mulhs q,M,n q = floor(M*n/2**32). 

add а, 9, п а = floor(M*n/2**32) + n. 

shrsi 4,4,2 q = floor(q/4). 

shri t,n,31 Add 1 to а if 

add arg, Е n is negative. 

muli t,q,7 Compute remainder from 

sub га: Е E= п - q*7. 


Proof. It is important to note that the instruction “ааа q, а, п” 
above cannot overflow. This is because q and n have opposite 
signs, due to the multiplication by a negative number. Therefore, 
this “computer arithmetic” addition is the same as real number 
addition. Hence for n = 0 the above code computes 


where we have used the corollary of Theorem D3. 


For 0 < n < 231, the error term 5n/ 7 · 234 is nonnegative 
and less than 1/7, soq = | n / 7]. 


For n < 0, the above code computes 


4 = (223 _ 11). tnlal+1 = [245221] 
7 232 7 7. 234 


The error term is nonpositive and greater than -1/7, so q = [n / 
71 


The multiply immediate can be done with a shift left of three 
and a subtract. 


10-4 Signed Division by Divisors > 2 


At this point you may wonder if other divisors present other 
problems. We see in this section that they do not; the three 
examples given illustrate the only cases that arise (for d > 2). 


Some of the proofs are a bit complicated, so to be cautious, 
the work is done in terms of a general word size W. 

Given а word size W = 3 and a divisor d, 2 < d < 2w-1 we 
wish to find the least integer m and integer p such that 


Е - H for 0<n<2"-1, and (1а) 

2Р а 

НЕ = H for -27-! < n< 1, (1b) 
2p d 


with 0 < m< 2w and p > W. 


The reason we want the least integer m is that a smaller 
multiplier may give a smaller shift amount (possibly zero) or 
may yield code similar to the “divide by 5" example, rather than 
the “divide by 7" example. We must have m < 2W- 1 so the 
code has no more instructions than that of the "divide by 7" 
example (that is, we can handle a multiplier in the range 2w -1 
to 2W- 1 by means of the aaa that was inserted in the “divide by 
7" example, but we would rather not deal with larger 
multipliers). We must have p 2 W, because the generated code 
extracts the left half of the product mn, which is equivalent to 
shifting right W positions. Thus, the total right shift is W or 
more positions. 

There is a distinction between the multiplier m and the 
“magic number," denoted M. The magic number is the value 
used in the multiply instruction. It is given by 


"EEL/ if 0xm«2-!. 
| т-2", 1Г2”-1к«т«27. 


Because (1b) must hold for n = -d,|- md/ 2p; + 1 = -1, 
which implies 


=]. (2) 


Let пс be the largest (positive) value of n such that тет(пс, d) 
= d- 1. пс exists because one possibility is пс = d — 1. It can be 
calculated from пс = | 2w -1 / d | d-1 = 2w- 1 – rem(2w -1, 
d) - 1. пс is one of the highest d admissible values of n, so 


29-1-4«п,52"-1-1, (За) 
and, clearly 
п.> 4-1. (3b) 


Because (1a) must hold for n = πε 


mn. | |n, хе, 
ЕЯ El α 


Or 
mn, Π.Ε] 
25 d 
Combining this with (2) gives 
p Рп. +1 
22 m < 2-5. А (4) 
d n 


Because m is to be the least integer satisfying (4), it is the 
next integer greater than 2p / d; that is, 


Р+ p 
этэ 2” + d гет(2 d) (5) 


d 


Combining this with the right half of (4) and simplifying gives 


2? > n.(d — тет(2?, α)). (6) 


The Algorithm 


Thus, the algorithm to find the magic number M and the shift 
amount s from d is to first compute пс, and then solve (6) for p 
by trying successively larger values. If p < W, set p = W (the 
theorem below shows that this value of p also satisfies (6)). 
When the smallest p = W satisfying (6) is found, m is calculated 
from (5). This is the smallest possible value of m, because we 
found the smallest acceptable p, and from (4) clearly smaller 
values of p yield smaller values of m. Finally, s = p- W and M is 
simply a reinterpretation of m as a signed integer (which is how 
the muins instruction interprets it). 


Forcing p to be at least W is justified by the following: 
THEOREM ОСТ. 1/(6) is true for some value of p, then it is true for 
all larger values of p. 
Proof. Suppose (6) is true for p — po. Multiplying (6) by 2 
gives 
2ро +1 > пє(2 а- 2гет(2ро, d)). 


From Theorem D5, геш(2ро +1, d) = 2rem(2po d) - d. 
Combining gives 


2p +1 > π((24 - (rem(2po + 1, d) + d)), or 
2p9 +1 > n((d – rem(2po +1, а)). 


Therefore, (6) is true for p = po + 1, and hence for all larger 
values. 

Thus, one could solve (6) by a binary search, although a 
simple linear search (starting with p = W) is probably 
preferable, because usually d is small, and small values of d give 
small values of p. 


Proof That the Algorithm Is Feasible 


We must show that (6) always has а solution and that 0 < т < 
2w. (It is not necessary to show that p = W, because that is 


forced.) 


We show that (6) always has a solution by getting an upper 
bound on p. As a matter of general interest, we also derive a 
lower bound under the assumption that p is not forced to be at 
least W. To get these bounds on p, observe that for any positive 
integer x, there is a power of 2 greater than x and less than or 
equal to 2x. Hence, from (6), 


nc(d — rem(2p, d)) < 2p x 2пс((а – rem(2p, d)). 


Because 0 x rem(2p, d) < d- 1, 


n, - 1x2? x2n,d. (7) 


From (За) and (3b), πε > max(2W - 1 - d, d - 1). The lines 
fi(d) = 2w -1 - d and fo(d) = а- 1 cross at d = (2м- + 1) / 
2. Hence n, > (2W-1 - 1) / 2. Because n, is an integer, пс > 2W 
-2. Because пс, d < 2W- ! - 1, (7) becomes 


2-2 + 1 < 2р < 2(2w-1-1)2 


Or 


W-1<ps<2W-2. (8) 


The lower bound р = W - 1 can occur (e.g., for W = 32, d = 
3), but in that case we set p = W. 


If p is not forced to equal W, then from (4) and (7), 


mE] 2n dn. + 1 
<m< P3 Ў 


Пс 


Using (3b) gives 


1-1 + 
4-191 m«2(n,41). 
d [4 
Because п. < 2W-1 - 1 (За), 
2<m < 2w-l. 


If p is forced to equal W, then from (4), 


2 2Wn, + | 


— < т< — 
а а n, 
Because 2 < d x 2W-1- 1 and n, = 2022, 
2 2 27 W-2 + 1 
< — — —l..A g or 


2W-1_] ш 2 2-2 


3<т<2#-1+1. 
Hence in either case m is within limits for the code schema 
illustrated by the “divide by 7" example. 


Proof That the Product Is Correct 


We must show that if p and m are calculated from (6) and (5), 
then Equations (1a) and (1b) are satisfied. 


Equation (5) and inequality (6) are easily seen to imply (4). 
(In the case that p is forced to be equal to W, (6) still holds, as 
shown by Theorem DC1.) In what follows, we consider 
separately the following five ranges of values of n: 


0 < n < n,. 
n.tisnsn.td-l, 
8. SHS-—l; 
—-n,—-dtisns-n,—1, and 
n = —n,- d. 
From (4), because m is an integer, 
ЭР 2/(п * 0-1 


—<т< 
а dn. 


Multiplying by n / 2p, for n > 0 this becomes 


n 2Pn(n, + 1)-n 
d 2 2Ρ dn, 


niz|imn|z|n, 2-1) 
d 2P d — 2"dn, 


For 0 < п < п, 0 < (2p-1) n / (2pdn) < 1 / d, so by 
Theorem D4, 


I^ 
| 
I^ 


. so that 


d 2Рап, 


Hence (1a) is satisfied in this case (0 x n x πι). 


E [wi - Я! 


For n > ng, n is limited to the range 


п„+1<п<п„+4-1, (9) 


because n > n, + d contradicts the choice of пс as the largest 
value of n such that rem(n,, d) = d – 1 (alternatively, from (За), 
n > nc + dimplies n > 2W- 1). From (4), for n = 0, 


п. +1 
n «ТП Hn 


d 2" d n, 


By elementary algebra, this can be written 


n mn _ Πε t 1 επ ni + D 
d 2P d dn 


From (9, 1 < п-п.  d- 1,so 


(10) 


с 


o< rz none 11) «а-1т +1 
dn d n 


Because n, = 4-1 (by (3b)) and (пс + 1) / пс has its maximum 
when n, has its minimum, 


с c 


(п-п.)(п. +1) 4-14-1-1 
0 — —— t “sa = [DL 
Ы ап а di 


c 


In (10), the term (пс + 1) / d is an integer. The term (n – nc)(nc 
+ 1) / dn; is less than or equal to 1. Therefore, (10) becomes 


HE mn < ΤΙ 
d| |22 ү d 


For all n in the range (9) п/ d | = (nc + 1) / d. Hence, (1a) is 
satisfied in this case (nc + 1 < n < nc + d- 1). 


For n < 0, from (4) we have, because m is an integer, 


2241 UP i ΚΕ. 


n. 


Multiplying by n / 2p, for n « 0 this becomes 
n. P 
nho «ти ΠΣ +] 
d n, 2? d X 


nl e| mn [+1<| 127 +1 |+ | 
d n, 2P d 2P 


Or 


Using Theorem D2 gives 
n(n, +1) ап. * 1| | <| mn |, үх [пр * 1) - 27a 1 FE 
dn, 2P 2Ра 


: 2? Pd 


n(n, +1) +1 <| ТП +15< πο 1) +1] 
dn 
Because n + 1 < 0, the right inequality can be weakened, 
giving 


п. n+l 


Fa led ae =] (11) 
d dn, 2P d 


For п. < n < -1, 


< <0, or 


Hence, by Theorem D4, 


n ntl = n 
d dn, Hi 


so that (1b) is satisfied in this case (-пс x n x -1). 
For n < -пс, n is limited to the range 


-n.-d&ns-n,-]. (12) 


(From (За), n < - nc - d implies that n < -2w - 1, which is 
impossible.) Performing elementary algebraic manipulation of 
the left comparand of (11) gives 


"е1 rn) * D 112 ти үл) (13) 
d ап, 2P d 
Еог-п-а+1 < n < -nc-1, 
Catinat), 1 Qt nn *1)*1 -(n *1)*1 NS 
dn. dn, dn, Е ап, d 


6 € ς ς 
The ratio (пс + 1) / nc is a maximum when пс is a minimum; 
that is, пс = 1-1. 


Therefore, 
—4+1)(4—1 +1 jl λα ΡΝ м. or 
d(d — 1) ап, dn, 
dn 


с 


From (13), because (- пс - 1) / d is an integer and the quantity 
added to it is between 0 and -1, 


21-1 «| mn ЫН 
4 2p d 


Fornintherange- nn -d- 1 = n < -nc- 1, 


η. πα 
a] 


Hence, | mn/ 2p; + 1 = [n / d|—that is, (1b) is satisfied. 


The last case, п = – пс – d, can occur only for certain values 
of d. From (За), - nç - d < -2W- 1, so if n takes on this value, 
we must have п = - пс - d = -2W-1, and hence пс = 20-1 - d. 


Therefore, rem(2W - 1, d) = rem(n, + d, d) = d - 1 (that is, d 
divides 2w-! + 1). 


For this case (n = - n, - d), (6) has the solution p = W- 1 
(the smallest possible value of p), because for p = W - 1, 


n.(d – rem(2^, d)) = (2V-! — q)(d - rem" - 1, d)) 


= (27-1-ада-(4- 1) = 2W-l1_d4<2W-1 = 2P. 

Then from (5), 
m = 2"! * d- remQV- d) _ 29-144-14-1) _ 27-14 1 
а а а ` 


Therefore, 
тп |+] = 2 1-27 l i _2#-1_ 1 T" 
2p d 2W 1 а 


ee a | 


so that (1b) is satisfied. 


This completes the proof that if m and p are calculated from 
(5) and (6), then Equations (1a) and (1b) hold for all admissible 
values of n. 


10-5 Signed Division by Divisors < -2 


Because signed integer division satisfies n + (-d) = -(n + d), it 
is adequate to generate code for n + |d| and follow it with an 


instruction to negate the quotient. (This does not give the 
correct result for d = -2γν -1, but for this and other negative 
powers of 2, you can use the code in Section 10-1, “Signed 
Division by a Known Power of 2," on page 205, followed by a 
negating instruction.) It will not do to negate the dividend, 
because of the possibility that it is the maximum negative 
number. 


It is possible to avoid the negating instruction. The scheme is 
to compute 


а = kd ifn<0, and 
2P 
q = kdk | ifn>0. 
ЭР 
Adding 1 if n > 0 is awkward (because one cannot simply use 
the sign bit of n), so the code will instead add 1 if q « O. This is 


equivalent, because the multiplier m is negative (as will be 
seen). 


The code to be generated is illustrated below for the case W 
= 32, d = -7. 


Click here to view code image 


li M,Ox6DB6DB6D Magic num, – (2**34+5) /7 + 
QUEE. 

mulhs q,M,n q = floor(M*n/2**32). 

sub q,q,n q = floor(M*n/2**32) - n. 

shrsi q,q,2 а = floor(q/4). 

shri t,q,31 Add 1 to q if 

add q,q,t q is negative (n is positive). 
muli t,q,-7 Compute remainder from 

sub JD r= n - q*(-7). 


This code is the same as that for division by +7, except that 
it uses the negative of the multiplier for +7, and a sub rather 
than an ada after the multiply, and the shri of 31 must use q 
rather than n, as discussed above. (The case of d = +7 could 
also use q here, but there would be less parallelism in the code.) 
The subtract will not overflow, because the operands have the 
same sign. This scheme, however, does not always work! 


Although the code above for W — 32, d — -7 is correct, the 
analogous alteration of the *divide by 3" code to produce code 
to divide by —3 does not give the correct result for W = 32, n = 
_231, 

Let us look at the situation more closely. 


Given a word size W = 3 and a divisor d, -2W-1 < d < -2, 
we wish to find the least (in absolute value) integer m and 
integer p such that 


mn | = H for -2/-! n € 0, and (14a) 
2P d 
E 1 = H for 1 € n « 2V-!, (14b) 
2? d 


with -2W x m x 0 and p = W. 


Proceeding similarly to the case of division by a positive 
divisor, let пс be the most negative value of n such that пс = kd 
+ 1 for some integer К. пс exists, because one possibility is nç = 
d + 1. It can be calculated from пс = | (- 20-1 - 1) /4 4-1 
= — 2W - 1 + rem(2W -1 + 1, d). пс is one of the least |d| 
admissible values of n, so 


-2"-1<п,<-2"-1-4-1, (15а) 


and, clearly 


n, Ed 1. (15b) 


Because (14b) must hold for п = - d, and (14a) must hold for n 
= Nc, we obtain, analogous to (4), 


PR.— p 
== | <т<®. (16) 
п. ] 


Because m is to be the greatest integer satisfying (16), it is the 
next integer less than 2p / d—that is, 


_ 2? – d— гет(2?, d) 


17 
d (17) 


Combining this with the left half of (16) and simplifying gives 


2? > n (d+ rem(2^, d)). (18) 


The proof that the algorithm suggested by (17) and (18) is 
feasible, and that the product is correct, is similar to that for a 
positive divisor, and will not be repeated. A difficulty arises in 
trying to prove that - 2W < m < 0. To prove this, consider 
separately the cases in which d is the negative of a power of 2, 
or some other number. For d = -2k, it is easy to show that пс = 
-2w-1 + 1,p = W + k - 1, and m = - 2w -1 -1 (which is 
within range). For d not of the form --2Κ, it is straightforward to 
alter the earlier proof. 


For Which Divisors Is m (-d) = - m (d)? 


By m(d) we mean the multiplier corresponding to a divisor d. If 
m(-d) — -m(d), code for division by a negative divisor can be 
generated by calculating the multiplier for |d|, negating it, and 
then generating code similar to that of the “divide by —7" case 
illustrated above. 


By comparing (18) with (6) and (17) with (5), it can be seen 
that if the value of n, for -d is the negative of that for d, then 
m(-d) = -m(d). Hence, m(-d) = m(d) can occur only when the 
value of пс calculated for the negative divisor is the maximum 
negative number, —2w - 1. Such divisors are the negatives of the 
factors of 2W -1 + 1. These numbers are fairly rare, as 
illustrated by the factorings below (obtained from Scratchpad). 


215 + 1 = 3-11:331 
231 + 1 = 3- 715,827,883 
263 + 1 = 33.19.43: 5419 - 77,158,673,929 


For all these factors, m(-d) = m(d). Proof sketch: For d > 0 we 
have пс = 2w- 1 - d. Because rem(2w - 1, d) = d - 1, (6) is 
satisfied by p = W - 1 and hence also Бур = W. For d < 0, 
however, we have пс = -2w - 1 and rem(2w - 1, d) = |а| -1. 
Hence, (18) is not satisfied for p = W-1 ог for = W, sop > 
W. 


10-6 Incorporation into a Compiler 


For a compiler to change division by a constant into a 
multiplication, it must compute the magic number M and the 
shift amount s, given a divisor d. The straightforward 
computation is to evaluate (6) or (18) for p = W, W +1, ... until 
it is satisfied. Then, m is calculated from (5) or (17). M is simply 
a reinterpretation of m as a signed integer, and s — p — W. 

The scheme described below handles positive and negative d 
with only a little extra code, and it avoids doubleword 
arithmetic. 


Recall that пс is given by 


„ = [ 27-1 -remQ"-1,4)- 1, #4>0, 
—2V-1- rem(2V-1-- 1, d), ifd<0. 


Hence, |n,| can be computed from 


а= 0, ifd>0, 
J. ifd <0; 


LA = t—]1- rem(t, |α]). 


The remainder must be evaluated using unsigned division, 
because of the magnitude of the arguments. We have written 
rem(t, |d|) rather than the equivalent rem(t, d), to emphasize 
that the program must deal with two positive (and unsigned) 
arguments. 


From (6) and (18), p can be calculated from 


2? > |". (αἱ — τεπι(2”, |d))). (19) 
and then |m| can be calculated from (c.f. (5) and (17)): 


2? + |d| — rem(2", |4) (20) 


|т| = 
Direct evaluation of rem(2p, |d|) in (19) requires “long 
division" (dividing a 2W-bit dividend by a W-bit divisor, giving 
a W-bit quotient and remainder), and, in fact, it must be unsigned 


long division. There is a way to solve (19), and to do all the 
calculations, that avoids long division and can easily be 
implemented in a conventional HLL using only W-bit arithmetic. 
We do, however, need unsigned division and unsigned 
comparisons. 

We can calculate rem(2p, |d|) incrementally, by initializing 
two variables q and r to the quotient and remainder of 2p 
divided by |d| with p — W - 1, and then updating q and r as p 
increases. 

As the search progresses—that is, when p is incremented by 1 
— 4 and r are updated from (see Theorem D5(a)) 


Click here to view code image 


а = 2*а; 

r= 2*r; 

if (r>= abs(d)) { 
q = 9+1; 


É r - abs(d);} 


The left half of inequality (4) and the right half of (16), 
together with the bounds proved for m, imply that q = | 2p /|d| | 
< 2W, so q is representable as a W-bit unsigned integer. Also, 0 
x r « |d|, so r is representable as a W-bit signed or unsigned 
integer. (Caution: The intermediate result 2r can exceed 2w -! – 
1, so r should be unsigned and the comparison above should also 
be unsigned.) 


Next, calculate 5 = |d| — г. Both terms of the subtraction are 
representable as W-bit unsigned integers, and the result is also 
(1 < 6 < |d|), so there is no difficulty here. 


To avoid the long multiplication of (19), rewrite it as 


P 
санэ б. 
In 
The quantity 2p / |πε| is representable as a W-bit unsigned 
integer (similar to (7), from (19) it can be shown that 2P < 2]|n.| 
: |d| and, for d = -2w- 1, пс = -2w-1 + 1 апар = 2W - 2, so 
that 2p / [πο] = 22w-2 / (2w-! — 1) < 2W for W = 3). Also, it 
is easily calculated incrementally (as p increases) in the same 
manner as for rem(2p, |d|). The comparison should be unsigned, 
for the case 2р / |πε| > 2w- 1 (which can occur, for large d). 


To compute m, we need not evaluate (20) directly (which 
would require long division). Observe that 


2P + |d| - rem(2^, |4) _ ΓΝ = 4+1. 


|d ld 


The loop closure test 2p / |пс| > Š is awkward to evaluate. 
The quantity 2p/ |пс| is available only in the form of a quotient 
qi and a remainder r1. 2p / |nc| may or may not be an integer (it 
is an integer only for d = 2w -2 + 1 and a few negative values 
of d). The test 2p / |пс| < ὃ can be coded as 


qi < $ | (qi = 8 &rı = 0). 


The complete procedure for computing m and s from a is 
shown in Figure 10-1, coded in C, for W — 32. There are a few 
places where overflow can occur, but the correct result is 
obtained if overflow is ignored. 


To use the results of this program, the compiler should 
generate the 1i and muins instructions, generate the aaa if a > 
O and м < 0, or the sub if a < 0 and м > 0, and generate the 
shrsi if s > 0. Then, the shri and final aaa must be generated. 


For W — 32, handling a negative divisor can be avoided by 
simply returning a precomputed result for d = 3 and d = 
715,827,883, and using m(- d) = - m(d) for other negative 
divisors. However, that program would not be significantly 
shorter, if at all, than the one given in Figure 10-1. 


Click here to view code image 


struct ms {int M; // Magic number 
int s;}; // and shift amount. 
struct ms magic(int d) { // Must have 2 <= d <= 2**31-1 
// or | -2**31 <= d <= -2. 
int p; 
unsigned ad, anc, delta, ql, rl, q2, r2, t; 
const unsigned two31 - 0x80000000; // 2**31. 


struct ms mag; 


ad = abs(d); 
t = two3l + ((unsigned)d >> 31); 
anc = t - 1 - t%ad; // Absolute value of nc. 


p = 31; // Init. p. 


ql = two3l/anc; // Init. ql = 2**p/|nc|. 
rl = two31 - ql*anc; // Init. rl = rem(2**p, Inci). 
q2 = two31l/ad; // Init. q2 = 2**p/|dl. 
r2 = two31 - q2*ad; // Init. r2 = rem(2**p, |а|). 
do { 
р=р +1; 
ql = 2*q1; // Update 41 = 2**p/|ncl. 
rl. 2 2*rl; // Update rl = rem(2**p, Incl). 
if (rl >= anc) { // (Must be an unsigned 
ql = ql + 1; // comparison here.) 
rl = rl - anc; } 
q2 = 2*q2; // Update q2 = 2**p/|d|. 
το = 2*r25 // Update r2 = rem(2**p, |а|). 
if (r2 >= ad) { // (Must be an unsigned 
q2 = а2 +1; // comparison here.) 


r2 = r2 - ad;} 
delta = ad - r2; 


} while (ql < delta || (ql == delta && rl == 0)); 
mag.M = q2 + 1; 

if (а < 0) mag.M = -mag.M; // Magic number and 

mag.s = p - 32; // shift amount to return. 


return mag; 


FIGURE 10-1. Computing the magic number for signed 
division. 


10-7 Miscellaneous Topics 


THEOREM DC2. The least multiplier m is odd if p is not forced to 
equal W. 


Proof. Assume that Equations (1a) and (1b) are satisfied with 
least (not forced) integer p, and m even. Then clearly m could be 
divided by 2 and p could be decreased by 1, and (1a) and (1b) 
would still be satisfied. This contradicts the assumption that p is 
minimal. 


Uniqueness 


The magic number for a given divisor is sometimes unique (e.g., 
for W = 32, d = 7), but often it is not. In fact, experimentation 
suggests that it is usually not unique. For example, for W — 32, 
d — 6, there are four magic numbers: 


М = 715827833 ((232+2)/6), s=0 
M = 1,431,655,766 ((232 + 2)/3), 1 
М --1,31,655,765 ((233-1)/3-232), s = 2 
М = –1,431,655,764 ((233 + 4)/3 – 232), 8-2 


Nevertheless, there is the following uniqueness property: 


2 
Il 


THEOREM DC3. For a given divisor d, there is only one multiplier 
m having the minimal value of p, if p is not forced to equal W. 


Proof. First consider the case d > 0. The difference between 
the upper and lower limits of inequality (4) is 2p/ апс. We have 
already proved (7) that if p is minimal, then 2p/dn, < 2. 
Therefore, there can be at most two values of m satisfying (4). 
Let m be the smaller of these values, given by (5); then m + 1 is 
the other. 


Let po be the least value of p for which m + 1 satisfies the 
right half of (4) (po is not forced to equal W). Then 


2P» + d — rem(2^», d) Mie 2Роп„ + 1 
d d n. ` 


М 
This simplifies to 


2po > Nc (2 d - τεπι(2ρο, d)). 


Dividing by 2 gives 


ds π[ά- srem(2”, d). 


Because rem(2p9, 4) < 2rem(2p9 - 1, d) (by Theorem D5 on page 
184), 


2po- 1 > n, (d - rem(2po 71, d)), 


contradicting the assumption that po is minimal. 
The proof for d « 0 is similar and will not be given. 


The Divisors with the Best Programs 


The program for d — 3, W — 32 is particularly short, because 


there is no add Or shrsi after the muins. What other divisors 
have this short program? 

We consider only positive divisors. We wish to find integers 
m and p that satisfy Equations (1a) and (1b), and for which p = 
W and 0 x m < 2w-1. Because any integers m and p that satisfy 
equations (1a) and (1b) must also satisfy (4), it suffices to find 
those divisors d for which (4) has a solution with p = W and 0 
< m < 2W 1. All solutions of (4) with p = Ware given by 


2V + kd — rem(2”, d) k 
d ` 
Combining this with the right half of (4) and simplifying gives 


m = mI 


А 2 š 
гет(2*, d) > kd - —. (21) 
n. 
The weakest restriction on rem(2w, d) is with k = 1 and nc at its 
minimal value of 2w -2. Hence, we must have 


rem(2W, d) > 4-4; 


that is, d divides 2W-- 1, 2W+ 2, or 2W+ 3. 
Now let us see which of these factors actually have optimal 
programs. 


If d divides 2W+ 1, then rem(2W, d) = d - 1. Then a solution 
of (6) is p = W, because the inequality becomes 


2w > nc (d - (d - 1) = ne, 


which is obviously true, because пс < 2w -1. Then in the 
calculation of m we have 


m = 2 +d-(d-1) _ 29 +1 
d а ` 
which is less than 2w -1 for d > 3 (d = 2 because d divides 2w 
+ 1). Hence, all the factors of 2w+ 1 have optimal programs. 
Similarly, if d divides 2W + 2, then rem(2w, d) = d - 2. 
Again, a solution of (6) is p = W, because the inequality 
becomes 


2W > nc (d- (d - 2)) = 2n; 


which is obviously true. Then in the calculation of m we have 


m = 2" * d-(d-2) _ 2712 
d a 
which exceeds 2w -1 — 1 for d = 2, but which is less than or 
equal to 2W-1 — 1 for W > 3, d = 3 (the case W = 3 and d = 3 
does not occur, because 3 is not a factor of 23 + 2 = 10). Hence 
all factors of 2W + 2, except for 2 and the cofactor of 2, have 
optimal programs. (The cofactor of 2 is (2W+ 2) / 2, which is 
not representable as a W-bit signed integer). 


If d divides 2w + 3, the following argument shows that d 
does not have an optimal program. Because rem(2W, d) = а- 3, 
inequality (21) implies that we must have 


Ww 
"um 
kd — d - 3 
for some k = 1, 2, 3, .... The weakest restriction is with k = 1, 
so we must have пс < 2W / 3. 


From (За), пс > 2W-1 - d, or d > 2W- 1 - ne. Hence, it is 
necessary that 


n 


2Η W 
3 6 

Also, because 2, 3, and 4 do not divide 2w + 3, the smallest 
possible factor of 2W + 3 is 5. Therefore, the largest possible 
factor is (2W + 3) / 5. Thus, if d divides 2W + 3 and d has an 
optimal program, it is necessary that 


d> 2Η- 1-- 


Taking reciprocals of this with respect to 2W + 3 shows that the 
cofactor of d, (2W + 3) / d, has the limits 


gored (ОРЗУЕ о 8 


d 2 2W" 


For W = 5, this implies that the only possible cofactors are 5 
and 6. For W < 5, it is easily verified that there are no factors of 
2w + 3. Because 6 cannot be a factor of 2W + 3, the only 
possibility is 5. Therefore, the only possible factor of 2w + 3 
that might have an optimal program is (2W + 3) / 5. 


For d = (2w + 3) / 5, 
ae 5W + 3 
=| En quen ү. 
т 5 


А 2-1 2 
< ———ə—— < 
(27 + 3)/5 


so 


5 

This exceeds (2W / 3), so d = (2W + 3) / 5 does not have an 
optimal program. Because for W « 4 there are no factors of 2W 
+ 3, we conclude that no factors of 2W + 3 have optimal 
programs. 

In summary, all the factors of 2W + 1 and of 2W + 2, except 
for 2 and (2W + 2) / 2, have optimal programs, and no other 
numbers do. Furthermore, the above proof shows that algorithm 
magic (Figure 10-1 on page 223) always produces the optimal 
program when it exists. 

Let us consider the specific cases W — 16, 32, and 64. The 
relevant factorizations are shown below. 


2'6+ 1 = 65537 (prime) 232+ 1 = 641. 6,700,417 
26+2 = 2.32.11. 331 232+2 = 2.3: 715,827,883 
2641 = 274,177 - 67,280,421,310,721 
26 +2 = 2-33-19- 43 - 5419 - 77,158,673,929 
The result for W = 16 is that there are 20 divisors that have 


optimal programs. The ones less than 100 are 3, 6, 9, 11, 18, 22, 
33, 66, and 99. 


For W = 32, there are six such divisors: 3, 6, 641, 6,700,417, 
715,827,883, and 1,431,655,766. 


For W = 64, there are 126 such divisors. The ones less than 
100 are 3, 6, 9, 18, 19, 27, 38, 43, 54, 57, and 86. 


10-8 Unsigned Division 


Unsigned division by a power of 2 is, of course, implemented by 
a single shift right logical instruction, and remainder by and 
immediate. 


It might seem that handling other divisors will be simple: 
Just use the results for signed division with d > 0, omitting the 
two instructions that add 1 if the quotient is negative. We will 
see, however, that some of the details are actually more 
complicated in the case of unsigned division. 


Unsigned Division by 3 


For a non-power of 2, let us first consider unsigned division by 3 
on a 32-bit machine. Because the dividend n can now be as large 
as 232 — 1, the multiplier (232 + 2) / 3 is inadequate, because 
the error term 2 n / 3: 222 (see “divide by 3" example above) 
can exceed 1/3. However, the multiplier (233 + 1) / 3 is 
adequate. The code is 


Click here to view code image 


lx M,OxAAAAAAAB Load magic number, 
(2**3341) /3. 
mulhu q,M,n q = floor(M*n/2**32). 


shri q,q,l 


muli t,q,3 Compute remainder from 
sub r,n,t г = п - q*3. 


An instruction that gives the high-order 32 bits of a 64-bit 
unsigned product is required, which we show above as muinu. 
To see that the code is correct, observe that it computes 


ян ЫГ п | 
q = |— = | = | 7+— |. 
3 23| |3 3.29 


For 0 < п < 232, 0 < п / (3. 233) < 1/3, so by Theorem D4, 
q = ( n/ 3]. 


In computing the remainder, the multiply immediate can 
overflow if we regard the operands as signed integers, but it 
does not overflow if we regard them and the result as unsigned. 
Also, the subtract cannot overflow, because the result is in the 
range 0 to 2, so the remainder is correct. 


Unsigned Division by 7 


For unsigned division by 7 on a 32-bit machine, the multipliers 
(232 + 3) / 7, (233 + 6) / 7, and (234 + 5) / 7 are all 
inadequate, because they give too large an error term. The 
multiplier (235 + 3) / 7 is acceptable, but it’s too large to 
represent in a 32-bit unsigned word. We can multiply by this 
large number by multiplying by (235 + 3) / 7 - 232 and then 
correcting the product by inserting an aaa. The code is 


Click here to view code image 


li M,0x24924925 Magic num, (2**35+3) /7 - 
РЭ: 

mulhu q,M,n а = floor (М*п/2**32). 

add q,q, n Can overflow (sets carry). 
shrxi q,q,3 Shift right with carry bit. 
muli t,q,7 Compute remainder from 

sub Erne r=n- q*7. 


Here we have a problem: The add can overflow. To allow for 
this, we have invented the new instruction shift right extended 
immediate (shrxi), which treats the carry from the ada and the 
32 bits of register q as a single 33-bit quantity, and shifts it right 
with 0-fill. On the Motorola 68000 family, this can be done with 
two instructions: rotate with extend right one position, followed 
by a logical right shift of three (roxr actually uses the X bit, but 
the add sets the X bit the same as the carry bit). On most 
machines, it will take more. For example, on PowerPC it takes 
three instructions: clear rightmost three bits of q, add carry to q, 
and rotate right three positions. 


With ѕһгхі implemented somehow, the code above computes 


4 \ 
{935 + M 1l» 
q = | c — 23? F4 +п|/23 | = Е + | 
|. 7 /232 | 7 7.23 


/ 


For 0 < п < 232, 0 < Зп /(7 : 235) < 1/7, so by Theorem D4, 
αξιπ/7ι. 

Granlund and Montgomery [GM] have a clever scheme for 
avoiding the snrxi instruction. It requires the same number of 
instructions as the above three-instruction sequence for snrxi, 
but it employs only elementary instructions that almost any 
machine would have, and it does not cause overflow at all. It 
uses the identity 


k - (Е |+ /^i 
2 р 2 4/4 


Applying this to our problem, with а = | Mn / 232 | where 0 
< M «252, the subtraction will not overflow, because 


0< = kd <n, 
232 


so that, clearly, 0 < n — q < 232. Also, the addition will not 
overflow, because 


k= Е k= - | 554 | 
2 2 2 


and 0 < nq < 232. 


Using this idea gives the following code for unsigned division 
by 7: 


Click here to view code image 


li M,0x24924925 Magic num, (2**354+3)/7 - 
2**32. 

mulhu q,M,n q = floor(M*n/2**32). 

sub t,n,q t-2n-q. 

shri t,t,l t =. πο gq)/2. 

add  t,t,q Е = (n - а)/2 + q = (n + q)/2. 
shri q,t,2 q = (n*Mn/2**32)/8 - 
floor (n/7). 

muli t,q,7 Compute remainder from 

sub f,n,t г =n = #7, 


For this to work, the shift amount for the hypothetical shrxi 


instruction must be greater than 0. It can be shown that if d > 1 
and the multiplier m > 232 (so that the snrxi instruction is 
needed), then the shift amount is greater than 0. 


10-9 Unsigned Division by Divisors > 1 


Given a word size W = 1 and a divisor d, 1 x d < 2W, we wish 
to find the least integer m and integer p such that 


Н = H for 0<n<2", (22) 
2 


with 0 < m < 2w*! and p > W. 
In the unsigned case, the magic number M is given by 


ESL if 0€ m « 2V, 
m —2V. if 2W m «2V*1. 
Because (22) must hold for n = d, | md/ 2p, = 1, or 
má > |. (23) 
2P 


As in the signed case, let пс be the largest value of n such that 
rem(nc, d) = d- 1. It can be calculated from пс = | 2W/d jd- 
1 = 2w-rem(2w, d) - 1. Then 


2’ -d&n,x2V —], (24a) 


and 


usd (24b) 
These imply that nç > 2w- 1. 
Because (22) must hold for n = ne 


mn, | _ Я _ n,—-(d- 1) 
2Р d d i 


Or 


mn, η. τα 
— < 
2P d 
Combining this with (23) gives 


p 2Рп. + 1 


2 
= С <= 25 
<m ( ) 


n, 


Because m is to be the least integer satisfying (25), it is the 
next integer greater than or equal to 2p / d—that is, 


_ 22 +4 1 - rem(2? - 1, d) 


(26) 


m 


d 


Combining this with the right half of (25) and simplifying gives 


2? > n.(d— 1 -- rem(2? - 1, d)). (27) 


The Algorithm (Unsigned) 


Thus, the algorithm is to find by trial and error the least p 2 W 
satisfying (27). Then, m is calculated from (26). This is the 
smallest possible value of m satisfying (22) with p 2 W. As in 
the signed case, if (27) is true for some value of p, then it is true 
for all larger values of p. The proof is essentially the same as that 
of Theorem DC1, except Theorem D5(b) is used instead of 
Theorem D5(a). 


Proof That the Algorithm Is Feasible (Unsigned) 


We must show that (27) always has a solution and that 0 < m 
< 2w +1, 


Because for any nonnegative integer x there is а power of 2 
greater than x and less than or equal to 2 x + 1, from (27), 


nc(d — 1 — rem(2p - 1, d)) < 2p < 2п.(4 - 1 - rem(2p - 1, d)) + 
1. 


Because 0 x rem(2p- 1, d) x d- 1, 


| «2? «2n, (4-1) * 1. (28) 


Because пс, d x 2W - 1, this becomes 
1 < 2p < 2(2w - 1)(2w - 2) + 1, 
Or 
0<р<2. (29) 


Thus, (27) always has a solution. 
If p is not forced to equal W, then from (25) and (28), 


1 2n(d—1)+1n,+1 

= < m < ———ƏÓÀ-———.—. Я 

а а n. 
24-2-1/п, 

І < m < —— — (n, + 1), 


1 $m«2(n, t 1) S2F* !, 


If p is forced to equal W, then from (25), 


W 2Wh.+] 
2 <т<2 Ы : 
а 


п, 


Because 1 < d < 2W- 1 and n, = 20-1, 
2W 2W 9W-14 1 


- <т< 
2W 1 1 2-1 


2<sms2" +1, 


3 


In either case m is within limits for the code schema illustrated 


by the *unsigned divide by 7" example. 


Proof That the Product Is Correct (Unsigned) 


We must show that if p and m are calculated from (27) and (26), 


then (22) is satisfied. 


Equation (26) and inequality (27) are easily seen to imply 
(25). Inequality (25) is nearly the same as (4), and the 
remainder of the proof is nearly identical to that for signed 


division with n = O. 


10-10 Incorporation into a Compiler (Unsigned) 


There is a difficulty in implementing an algorithm based on 
direct evaluation of the expressions used in this proof. Although 
p < 2 W, which is proved above, the case p = 2 W can occur 
(e.g., for d = 2w - 2 with W = 4). When p = 2 W, it is difficult 
to calculate m, because the dividend in (26) does not fit in a 2W- 
bit word. 


However, it can be implemented by the "incremental division 
and remainder" technique of algorithm magic. The algorithm is 
given in Figure 10-2 for W — 32. It passes back an indicator a, 
which tells whether or not to generate an aaa instruction. (In 
the case of signed division, the caller recognizes this by m and a 
having opposite signs.) 

Some key points in understanding this algorithm are as 
follows: 


• Unsigned overflow can occur at several places and should 
be ignored. 


«Πε = 2w - rem (2w,d) - 1 = (2w - 1) - rem(2W - d, d). 


* The quotient and remainder of dividing 2p by пс cannot 
be updated in the same way as is done in algorithm 
magic, because here the quantity 2*r1 can overflow. 
Hence, the algorithm has the test “if (rl > = nc - r 
1)," whereas “if (2*r1 >= nc)" would be more natural. 
A similar remark applies to computing the quotient and 
remainder of 2P -1 divided by d. 


*0 < 6б < d- 1, so ὃ is representable as a 32-bit unsigned 
integer. 


Click here to view code image 


struct mu {unsigned M; // Magic number, 
int a; // "add" indicator, 
int s;}; // and shift amount. 


struct mu magicu(unsigned d) { 
// Must have 1 «- d «- 
2**32-1. 
int p; 
unsigned nc, delta, ql, rl, q2, r2; 
struct mu magu; 


; // Initialize "add" indicator. 
nc = -1 - (-d)$d; // Unsigned arithmetic here. 


р = 31; {/ Init. p. 
ql = 0x80000000/nc; // Init. ql = 2**p/nc. 
rl = 0x80000000 - ql*nc;// Init. rl = rem(2**p, nc). 
q2 = Ox7FFFFFFF/d; // Init. q2 - (2**p - 1)/d. 
r2 = Ox7FFFFFFF - q2*d; // Init. r2 = rem(2**p - 1, а). 
do { 
p = p + 1; 
if (rl >= nc = rl) £ 
ql = 2541 + 1; // Update ql. 
rl = 2*rl - nc;] // Update r1. 
else { 
ql = 2*ql; 
rl = 2*r1;) 


if (r2 +1 >= а - r2) { 
if (42 >= Ox7FFFFFFF) magu.a = 1; 


42 = 2*q2 + 1; // Update q2. 

r2 = 2*r2 + 1 - d;} // Update r2. 
else { 

if (42 >= 0x80000000) magu.a = 1; 

42 = 2*g2; 


r2 —.2*r2 Д 
delta = а - 1 - r2; 
} while (р < 64 && 


(41 < delta || (ql == delta && rl == 0))); 
magu.M = q2 + 1; // Magic number 
magu.s = p - 32; // and shift amount to return 
return magu; // (magu.a was set above). 


FIGURE 10-2. Computing the magic number for unsigned 
division. 


em = (2p *d-1-rem(2p- 1, d))/ d = | (2p-1)/d | 
+1 = 92 + 1. 


* The subtraction of 2W when the magic number м exceeds 
2W - 1 is not explicit in the program; it occurs if the 
computation of а2 overflows. 


* The “add” indicator, magu.a, cannot be set by a 
straightforward comparison of м to 232, or of q2 to 232 – 
1, because of overflow. Instead, the program tests q2 
before overflow can occur. If a2 ever gets as large as 232 
- 1, so that м will be greater than or equal to 232, then 


magu.a is set equal to 1. If q2 stays below 232 - 1, then 
magu.a is left at its initial value of 0. 


* Inequality (27) is equivalent to 2p/ пс > 8. 


e The loop test needs the condition р < 64, because 
without it, overflow of qi would cause the program to 
loop too many times, giving incorrect results. 


To use the results of this program, the compiler should 
generate the 1: and muinu instructions and, if the “add” 
indicator a = 0, generate the shri of s (if s > 0), as 
illustrated by the example of *Unsigned Division by 3," on page 
227. If a = 1 and the machine has the shrxi instruction, the 
compiler should generate the aaa and shrxi of s as illustrated 
by the example of “Unsigned Division by 7," on page 228. If a 
= 1 and the machine does not have the snrxi instruction, use 
the example on page 229: generate the sub, the shri of 1, the 
add, and finally the shri of s- 1 (if 5-1 > 0; s will not be 0 
at this point except in the trivial case of division by 1, which we 
assume the compiler deletes). 


10-11 Miscellaneous Topics (Unsigned) 


THEOREM DCAU. The least multiplier m is odd if p is not forced to 
equal W. 


THEOREM DC3U. For a given divisor d, there is only one 
multiplier m having the minimal value of p, if p is not forced to equal 
W. 


The proofs of these theorems follow very closely the 
corresponding proofs for signed division. 


The Divisors with the Best Programs (Unsigned) 


For unsigned division, to find the divisors (if any) with optimal 
programs of two instructions to obtain the quotient (1i, mulhu), 
we can do an analysis similar to that of the signed case (see “The 
Divisors with the Best Programs" on page 225). The result is that 
such divisors are the factors of 2W or 2W + 1, except for d = 1. 
For the common word sizes, this leaves very few nontrivial 
divisors that have optimal programs for unsigned division. For 
W = 16, there are none. For W = 32, there are only two: 641 
and 6,700,417. For W — 64, again there are only two: 274,177 
and 67,280,421,310,721. 


The case d — 2k, k — 1, 2, ..., deserves special mention. In 
this case, algorithm magicu produces p = W (forced), m = 232- 
k. This is the minimal value of m, but it is not the minimal value 
of M. Better code results if p — W -- k is used, if sufficient 
simplifications are done. Then, т = 2w, М = 0, а = 1, ands = 
k. The generated code involves a multiplication by 0 and can be 
simplified to a single shift right k instruction. As a practical 
matter, divisors that are a power of 2 would probably be special- 
cased without using magicu. (This phenomenon does not occur 
for signed division, because for signed division m cannot be a 
power of 2. Proof: For d > 0, inequality (4) combined with (3b) 
implies that 4-1 < 2p / т < d. Therefore, 2p / m cannot be ап 
integer. For d < 0, the result follows similarly from (16) 
combined with (15b).) 


For unsigned division, the code for the case m = 2W is 
considerably worse than the code for the case m « 2w if the 
machine does not have shrxi. It is of interest to have some idea 
of how often the large multipliers arise. For W — 32, among the 
integers less than or equal to 100, there are 31 “bad” divisors: 1, 
7, 14, 19, 21, 27, 28, 31, 35, 37, 38, 39, 42, 45, 53, 54, 55, 56, 
57, 62, 63, 70, 73, 74, 76, 78, 84, 90, 91, 95, and 97. 


Using Signed in Place of Unsigned Multiply, and the Reverse 


If your machine does not have muinu, but it does have muins (or 
signed long multiplication), the trick given in “High-Order 
Product Signed from/to Unsigned," on page 174, might make 
our method of doing unsigned division by a constant still useful. 


That section gives a seven-instruction sequence for getting 
mulhu from muihs. However, for this application it simplifies, 
because the magic number M is known. Thus, the compiler can 
test the most significant bit of the magic number, and generate 
code such as the following for the operation *muinu q,M, n." Here 
t denotes a temporary register. 


Click here to view code image 


Μαι = 0 M31 = 1 
mulhs а,М, п mulhs q,M,n 
shrsi t,n,31 Shrsi t,n,31 
and t,t,M and t,t,M 
add q,q,t add ttr 


add q,q,t 


Accounting for the other instructions used with muinu, this 
uses a total of six to eight instructions to obtain the quotient of 
unsigned division by a constant on a machine that does not have 
unsigned multiply. 


This trick can be inverted, to get muins in terms of mulhu. 
The code is the same as that above, except the muins is changed 
to mulhu and the final aaa in each column is changed to sub. 


A Simpler Algorithm (Unsigned) 


Dropping the requirement that the magic number be minimal 
yields a simpler algorithm. In place of (27) we can use 


2? > 2V(d— | — rem(2? - 1, d)), (30) 
and then use (26) to compute m, as before. 


It should be clear that this algorithm is formally correct (that 
is, that the value of m computed does satisfy Equation (22)), 
because its only difference from the previous algorithm is that it 
computes a value of p that, for some values of d, is unnecessarily 
large. It can be proved that the value of m computed from (30) 
and (26) is less than 2w +1. We omit the proof and simply give 
the algorithm (Figure 10-3). 


Click here to view code image 


struct mu {unsigned M; // Magic number, 
int a; // "add" indicator, 
int s;}; // and shift amount. 


struct mu magicu2(unsigned d) { 
// Must have 1 <= d <= 
2**32-1. 
int p; 
unsigned p32, q, r, delta; 
struct mu magu; 


magu.a = 0; // Initialize "add" indicator. 
p = 31; // Initialize p. 
а = Ox7FFFFFFF/d; // Initialize q = (2**p - 1)/ 
d. 

г = 0х7ЕЕЕЕЕЕЕ - αλα; // Init. r = rem(2**p - 1, а). 
do { 

р=р +1; 

if (р == 32) р32 = 1; // Set p32 = 2**(p-32). 


else p32 = 2*p32; 


if (r + 1 >= d - r) { 
if (q >= Ox7FFFFFFF) magu.a = 1; 
que 28g qd // Update q. 
r = 2*r + 1 - d; // Update r. 
} 
else { 

if (q >= 0x80000000) magu.a = 1; 

а = 2*а; 

r= 2*r + 1; 


} 


delta = d - 1 - r; 
} while (р < 64 && р32 < delta); 
magu.M = а + 1; // Magic number and 
magu.s = p - 32; // shift amount to return 
return magu; // (magu.a was set above). 


FIGURE 10-3. Simplified algorithm for computing the magic 
number, unsigned division. 


Alverson [Alv] gives a much simpler algorithm, discussed in 
the next section, but it gives somewhat large values for m. The 
point of algorithm magicu2 is that it nearly always gives the 
minimal value for m when d < 20-1. For W = 32, the smallest 
divisor for which magicu2 does not give the minimal multiplier 
is d — 102,807, for which magicu calculates m — 2,737,896,999 
and magicu2 calculates m — 5,475,793,997. 


There is an analog of magicu2 for signed division by positive 
divisors, but it does not work out very well for signed division 
by arbitrary divisors. 


10-12 Applicability to Modulus and Floor Division 


It might seem that turning modulus or floor division by a 
constant into multiplication would be simpler, in that the *add 1 
if the dividend is negative" step could be omitted. This is not the 
case. The methods given above do not apply in any obvious way 
to modulus and floor division. Perhaps something could be 
worked out; it might involve altering the multiplier rn slightly, 
depending upon the sign of the dividend. 


10-13 Similar Methods 


Rather than coding algorithm magic, we can provide a table that 
gives the magic numbers and shift amounts for a few small 


divisors. Divisors equal to the tabulated ones multiplied by a 
power of 2 are easily handled as follows: 


1. Count the number of trailing 0’s in d, and let this be 
denoted by k. 


2. Use as the lookup argument d / 2k (shift right К). 
3. Use the magic number found in the table. 
4. Use the shift amount found in the table, increased by k. 


Thus, if the table contains the divisors 3, 5, 25, and so on, 
divisors of 6, 10, 100, and so forth can be handled. 


This procedure usually gives the smallest magic number, but 
not always. The smallest positive divisor for which it fails in this 
respect for W — 32 is d — 334,972, for which it computes m — 
3,361,176,179 and s = 18. However, the minimal magic 
number for d = 334,972 is m = 840,294,045, with s = 16. The 
procedure also fails to give the minimal magic number for d = - 
6. In both these cases, output code quality is affected. 


Alverson [Alv] is the first known to the author to state that 
the method described here works with complete accuracy for all 
divisors. Using our notation, his method for unsigned integer 
division by d is to set the shift amount p = W + Г logo d 1, and 
the multiplier m = Г 2p / d 1 and then do the division by | n + 
d = | mn / 2p | (that is, multiply and shift right). He proves that 
the multiplier m is less than, 2w+1! and that the method gets the 
exact quotient for all n expressible in W bits. 


Alverson's method is a simpler variation of ours in that it 
doesn't require trial and error to determine p, and is therefore 
more suitable for building in hardware, which is his primary 
interest. His multiplier m is always greater than or equal to 2W, 
and hence for the software application always gives the code 
illustrated by the “unsigned divide by 7" example (that is, 
always has the ааа and shrxi, or the alternative four 
instructions). Because most small divisors can be handled with a 
multiplier less than 2W, it seems worthwhile to look for these 
cases. 


For signed division, Alverson suggests finding the multiplier 
for |d| and a word length of W - 1 (then 2w -1 < m < 2w), 
multiplying the dividend by it, and negating the result if the 
operands have opposite signs. (The multiplier must be such that 


it gives the correct result when the dividend is 2w -1, the 
absolute value of the maximum negative number.) It seems 
possible that this suggestion might give better code than what 
has been given here in the case that the multiplier m = 2w. 
Applying it to signed division by 7 gives the following code, 
where we have used the relation -x = X +1 to avoid a branch: 


Click here to view code image 


abs an,n 

Ta M,0x92492493 Magic number, (2**34+5)/7. 
mulhu q,M,an q = floor (M*an/2**32). 
shri 4,9,2 

shrsi t,n,31 These three instructions 
хог q,q,t negate q if n is 

sub q,q,t negative. 


This is not quite as good as the code we gave for signed 
division by 7 (six versus seven instructions), but it would be 
useful on a machine that has abs and mulhu, but not mulhs. 


The next section gives some representative magic numbers. 
10-14 Sample Magic Numbers 


TABLE10-1. SOME MAGIC NUMBERS FOR W = 32 


м» [$ ww [s] + 


99999999 
555553555 
7FFFFFFF 


80000001 
55555556 
66666667 
2AAAAAAB 
92492493 
38E3 8E39 
66666667 
2E8BA2E9 
2AAAAAAB 
51EB851F 
10624DD3 
68DB8BAD 


232* 
АААААААВ 
CCCCCCCD 
AAAAAAAB 
24924925 
38E3 8E39 
CCCCCCCD 
BA2E8BA3 
AAAAAAAB 
51EB851F 
10624DD3 
D1B71759 


O cc O m= 


о © © c © © o — 


TABLE 10-2. SOME MAGIC NUMBERS FOR W = 64 


о c 


1 
2 
2 
3 
1 
3 
3 
3 
3 
3 
9 


99999999 99999999 
55555555 55555555 
7FFFFFFF FFFFFFFF 
8000000000000001 
55555555 55555556 
66666666 66666667 
2ААААААА AAAAAAAB 
49249249 24924925 
1C71C71C71C71C72 
66666666 66666667 
2E8BA2E8 BA2E8BA3 
2AAAAAAA AAAAAAAB 
A3D7 0A3D 70A3D70B 
20C49BA5 E353F7CF 
346DC5D6 3886594B 


Шанх анж хашаан 
HER GE DE 


0 


264+ 


АААААААА АААААААВ 
CCCCCCCC CCCCCCCD 
AAAAAAAA AAAAAAAB 
24924924 92492493 
E38E38E3 8E38E38F 
CCCCCCCC CCCCCCCD 
2E8BA2E8 BA2E8BA3 
АААААААА AAAAAAAB 
47АЕ147АЕ147АЕ15 
0624DD2F 1А9ЕВЕ77 
346DC5D6 38865948 


10-15 Simple Code in Python 


Computing a magic number is greatly simplified if one is not 
limited to doing the calculations in the same word size as that of 
the environment in which the magic number will be used. For 
the unsigned case, for example, in Python it is straightforward to 
compute пс and then evaluate Equations (27) and (26), as 
described in Section 10-9. Figure 10-4 shows such a function. 


Click here to view code image 


def magicgu(nmax, d): 
nc = (nmax//d)*d - 1 
nbits = int(log(nmax, 2)) + 1 
for p in range(0, 2*nbits + 1): 
if 2**p > nc*(d- 1 = (2**p - 1) 
m= (2**p*td-1- (2**p = 1 
return (m, p) 
print "Can't find p, something is wrong." 
sys.exit(1) 


FIGURE 10-4. Python code for computing the magic number 


for unsigned division. 


The function is given the maximum value of the dividend 
nmax and the divisor a. It returns a pair of integers: the magic 
number m and a shift amount р. To divide a dividend x by a, 
one multiplies x by m and then shifts the (full length) product 
right p bits. 


This program is more general than the others in this chapter 
in two ways: (1) one specifies the maximum value of the 
dividend (nmax), rather than the number of bits required for the 
dividend, and (2) the program can be used for arbitrarily large 
dividends and divisors (“bignums”). The advantage of specifying 
the maximum value of the dividend is that one sometimes gets a 
smaller magic number than would be obtained if the next power 
of two less 1 were used for the maximum value. For example, 
suppose the maximum value of the dividend is 90, and the 
divisor is 7. Then function magicgu returns (37, 8), meaning that 
the magic number is 37 (a 6-bit number) and the shift amount is 
8. But if we asked for a magic number that can handle divisors 
up to 127, then the result is (147, 10), and 147 is an 8-bit 
number. 


10-16 Exact Division by Constants 


By “exact division," we mean division in which it is known 
beforehand, somehow, that the remainder is 0. Although this 
situation is not common, it does arise, for example, when 
subtracting two pointers in the C language. In C, the result of p — 
q, where p and q are pointers, is well defined and portable only 
if p and q point to objects in the same array [H&S, sec. 7.6.2]. If 
the array element size is s, the object code for the difference p — 
q computes (р ~ q) / s. 


The material in this section was motivated by [GM, sec. 9]. 


The method to be given applies to both signed and unsigned 
exact division, and is based on the following theorem. 


THEOREM MI. If a and m are relatively prime integers, then there 
exists an integer à, 1 < a « m, such that 
aa = 1 (mod m). 


That is, à is a multiplicative inverse of a, modulo m. There 


are several ways to prove this theorem; three proofs are given in 
[NZM, p. 52]. The proof below requires only a very basic 
familiarity with congruences. 


Proof. We will prove something a little more general than the 
theorem. If a and m are relatively prime (therefore nonzero), 
then as x ranges over all m distinct values modulo m, ax takes on 
all m distinct values modulo m. For example, if a = 3 and m = 
8, then as x ranges from 0 to 7, ах = 0, 3, 6, 9, 12, 15, 18, 21 
or, reduced modulo 8, ax = 0, 3, 6, 1, 4, 7, 2, 5. Observe that all 
values from 0 to 7 are present in the last sequence. 


To see this in general, assume that it is not true. Then there 
exist distinct integers that map to the same value when 
multiplied by a; that is, there exist x and y, with x ** y(mod m), 
such that 


ax = ay (mod m). 

Then there exists an integer k such that 
ах — ay = km, or 

a (x — y) = km. 


Because a has no factor in common with m, it must be that x — y 
is a multiple of m; that is, 


x = y (mod m). 


This contradicts the hypothesis. 


Now, because ax takes on all m distinct values modulo т, as 
x ranges over the m values, it must take on the value 1 for some 
X: 


The proof shows that there is only one value (modulo m) of x 
such that ax = 1 (mod m)—that is, the multiplicative inverse is 
unique, apart from additive multiples of m. It also shows that 
there is a unique (modulo m) integer x such that ax = b (mod 
m), where b is any integer. 


As an example, consider the case m = 16. Then 3 = 11, 
because 3: 11 = 33 = 1 (mod 16). We could just as well take 


3 = —5, because 3 : (5) = -15 = 1 (mod 16). Similarly 
—3 = 5, because (-3): 5 = -15 = 1 (mod 16). 


These observations are important because they show that the 
concepts apply to both signed and unsigned numbers. If we are 
working in the domain of unsigned integers on a 4-bit machine, 
we take 3 = 11. In the domain of signed integers, we take 


3 = -5. But 11 and -5 have the same representation in two's- 
complement (because they differ by 16), so the same computer 


word contents can serve in both domains as the multiplicative 
inverse. 


The theorem applies directly to the problem of division 
(signed and unsigned) by an odd integer d on a W-bit computer. 
Because any odd integer is relatively prime to 2w, the theorem 


says that if d is odd, there exists an integer d (unique in the 
range 0 to 2w- 1 or in the range -2W- 1 to 2w-1 – 1) such that 


dd=1 (mod 2"). 


Hence, for any integer n that is a multiple of d, 


= = = (dd) = nd (mod 2”). 


In other words, . can be calculated by multiplying n Бу d and 
retaining only the rightmost W bits of the product. 

If the divisor d is even, let d = dọ : 2k, where 0, is odd and К 
> 1. Then, simply shift n right К positions (shifting out 0’s), and 
then multiply by d, (the shift could be done after the 
multiplication as well). 

Below is the code for division of n by 7, where n is a multiple 
of 7. This code gives the correct result whether it is considered 
to be signed or unsigned division. 


Click here to view code image 


li M,OxB6DB6DB7 Mult. inverse, (5*2**32 + 1)/7. 
mul q,M,n а = n/7. 


Computing the Multiplicative Inverse by the Euclidean 
Algorithm 


How can we compute the multiplicative inverse? The standard 
method is by means of the “extended Euclidean algorithm." This 
is briefly discussed below as it applies to our problem, and the 
interested reader is referred to [NZM, p. 13] and to [Knu2, 


4.5.2] for a more complete discussion. 
Given an odd divisor d, we wish to solve for x 


dx = 1(mod/m), 


where, in our application, m = 2W and W is the word size of the 
machine. This will be accomplished if we can solve for integers x 
and y (positive, negative, or 0) the equation 


dx + my = 1. 


Toward this end, first make d positive by adding a sufficient 
number of multiples of m to it. (d and d + km have the same 
multiplicative inverse.) Second, write the following equations (in 
which d, m > 0): 


d(-1)+m(1) = т-а (1) 
d(1)+m(0) = d. (ii) 


If d = 1, we are done, because (ii) shows that x = 1. Otherwise, 


compute 
_ | т-а | 
κά 


Third, multiply Equation (ii) by q and subtract it from (i). This 
gives 


d (-1-q) + m(1) = m-d- qd = rem(m - d, d). 


This equation holds because we have simply multiplied one 
equation by a constant and subtracted it from another. If rem(m 
- d, d) = 1, we are done; this last equation is the solution and х 
--1-4. 

Repeat this process on the last two equations, obtaining a 
fourth, and continue until the right-hand side of the equation is 
1. The multiplier of d, reduced modulo m, is then the desired 
inverse of d. 


Incidentally, if m - d < d, so that the first quotient is 0, then 
the third row will be a copy of the first, so that the second 
quotient will be nonzero. Furthermore, most texts start with the 


first row being 
d(0) + m(1) = m, 
but in our application m = 2W is not representable in the 


machine. 


The process is best illustrated by an example: Let m — 256 
and d = 7. Then the calculation proceeds as follows. To get the 
third row, note that q = |249 /7 | = 35. 


Click here to view code image 


7(-1) + 256( 1) = 249 
7( 1) + 256( 0) = 7 
7(-36) + 256( 1) = 4 
7( 37) + 256(-1) = 3 
7(-73) + 256( 2) = 1 


Thus, the multiplicative inverse of 7, modulo 256, is -73 ог, 
expressed in the range 0 to 255, is 183. Check: 7 - 183 = 1281 
= 1 (mod 256). 


From the third row on, the integers in the right-hand column 
are all remainders of dividing the number above it into the 
number two rows above it, so they form a sequence of strictly 
decreasing nonnegative integers. Therefore, the sequence must 
end in O (as the above would if carried one more step). 
Furthermore, the value just before the 0 must be 1, for the 
following reason. Suppose the sequence ends in b followed by 0, 
with b = 1. Then, the integer preceding the b must be a multiple 
of b, let's say kıb, for the next remainder to be 0. The integer 
preceding Κι b must be of the form К1К2 b+ b, for the next 
remainder to be b. Continuing up the sequence, every number 
must be a multiple of b, including the first two (in the positions 
of the 249 and the 7 in the above example). This is impossible, 
because the first two integers are m - d and d, which are 
relatively prime. 


This constitutes an informal proof that the above process 
terminates, with a value of 1 in the right-hand column, and 
hence it finds the multiplicative inverse of d. 

To carry this out on a computer, first note that if d « 0, we 
should add 2w to it. With two's-complement arithmetic it is not 
necessary to actually do anything here; simply interpret d as an 


unsigned number, regardless of how the application interprets it. 
The computation of q must use unsigned division. 


Observe that the calculations can be done modulo m, because 
this does not change the right-hand column (these values are in 
the range 0 to m - 1 anyway). This is important, because it 
enables the calculations to be done in "single precision," using 
the computer's modulo-2w unsigned arithmetic. 


Most of the quantities in the table need not be represented. 
The column of multiples of 256 need not be represented, 
because in solving dx + my = 1, we do not need the value of y. 
There is no need to represent d in the first column. Reduced to 
its bare essentials, then, the calculation of the above example is 
carried out as follows: 
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255 249 
1 7 
220 4 
37 3 
183 1 


A C program for performing this computation is shown in 
Figure 10-5. 


Click here to view code image 


unsigned mulinv (unsigned d) { // d must be odd. 
unsigned x1, vl, x2, v2, x3, v3, а; 


xl = OxFFFFFFFF; vl = -d; 
х2 = 1; v2 =а; 
while (v2 > 1) { 
q = v1/v2; 
x3 = xl - q*x2; v3 = vl - q*v2; 
xl e х2; wl = ұл; 
x2 = x3; v2 = v3; 


) 


return x2; 


FIGURE 10-5. Multiplicative inverse modulo 232 by the 
Euclidean algorithm. 


The reason the loop continuation condition is (v2 » 1), 
rather than the more natural (v2 != 1), is that if the latter 
condition were used, the loop would never terminate if the 
program were invoked with an even argument. It is best that 
programs not loop forever even if misused. (If the argument a is 
even, v2 never takes on the value 1, but it does become 0.) 


What does the program compute if given an even argument? 
As written, it computes a number x such that dx = 0 (mod 232), 
which is probably not useful. However, with the minor 
modification of changing the loop continuation condition to (v2 
!= 0) and returning x1 rather than x2, it computes a number x 
such that dx = g (mod 232), where g is the greatest common 
divisor of d and 232—that is, the greatest power of 2 that divides 
d. The modified program still computes the multiplicative 
inverse of d for d odd, but it requires one more iteration than the 
unmodified program. 


As for the number of iterations (divisions) required by the 
above program, for d odd and less than 20, it requires a 
maximum of 3 and an average of 1.7. For d in the neighborhood 
of 1000, it requires a maximum of 11 and an average of about 6. 


Computing the Multiplicative Inverse by Newton's Method 


It is well known that, over the real numbers, 1 / d, for = 0, 
can be calculated to ever-increasing accuracy by iteratively 
evaluating 


Xn+1 = X,(2 — dx,). (31) 


provided the initial estimate xo is sufficiently close to 1/d. The 
number of digits of accuracy approximately doubles with each 
iteration. 


It is not so well known that this same formula can be used to 
find the multiplicative inverse modulo any power of 2!. For 
example, to find the multiplicative inverse of 3, modulo 256, 
start with xo — 1 (any odd number will do). Then, 


x, = 1(2-3-1) =-1, 
x, = -1(2-3(-1)) = -5, 
k = -5(2-365)) = 85, 
x, = -85(2—-3(-85)) = -21845 = -85 (mod 256). 


The iteration has reached a fixed point modulo 256, so -85, or 
171, is the multiplicative inverse of 3 (modulo 256). All 
calculations can be done modulo 256. 


Why does this work? Because if хи satisfies 
dxn = 1 (mod m) 
and if xn + 1 is defined by (31), then 
dxn +1 = 1 (mod m2). 


To see this, let dxn = 1 + km. Then 


= dx, (2 —dx,) 

= (1 + km)(2 — (1 + km)) 
= (1 + km)(1 — km) 

= 1] — ἐπι; 


1 (mod m?). 
In our application, m is a power of 2, say 2N. In this case, if 


dx, +1 


dx,,=1 (mod 2"), then 


dx,, 1 (mod 224), 


In a sense, if хи is regarded as a sort of approximation to d, then 
each iteration of (31) doubles the number of bits of *accuracy" 
of the approximation. 

It happens that modulo 8, the multiplicative inverse of any 
(odd) number d is d itself. Thus, taking xo — d is a reasonable 


and simple initial guess at d. Then, (31) will give values of x1, 
хо, ..., such that 


dx, =1 (mod 29), 
1 (mod 212), 


= 1 (mod 224), 


|| III 


dx ,= 1 (mod 238), and so on. 


Thus, four iterations suffice to find the multiplicative inverse 
modulo 232 (if x = 1 (mod 248), then x = 1 (mod 2n) for n < 
48). This leads to the C program in Figure 10-6, in which all 
computations are done modulo 232. 


For about half the values of d, this program takes 4.5 
iterations, or nine multiplications. For the other half (those for 
which the initial value of xn is “correct to 4 bits"—that is, 42 = 
1 (mod 16), it takes seven or fewer, usually seven, 
multiplications. Thus, it takes about eight multiplications on 
average. 


Click here to view code image 


unsigned mulinv(unsigned d) { // d must be odd. 
unsigned xn, t; 


xn = d; 
loop: t = d*xn; 
if (t == 1) return xn; 
xn = xn*(2 = t); 
goto loop; 


FIGURE 10-6. Multiplicative inverse modulo 232 by Newton's 
method. 


A variation is to simply execute the loop four times, 
regardless of d, perhaps “strung out” to eliminate the loop 
control (eight multiplications). Another variation is to somehow 
make the initial estimate xo "correct to 4 bits" (that is, find xo 
that satisfies ахо = 1 (mod 16)). Then, only three loop iterations 
are required. Some ways to set the initial estimate are 


х < d - 2((d - 1) & 4), and 


xo €- d? 4 d — 1. 


Here, the multiplication by 2 is a left shift, and the computations 
are done modulo 232 (ignoring overflow). Because the second 
formula uses a multiplication, it saves only one. 


This concern about execution time is, of course, totally 
unimportant for the compiler application. For that application, 
the routine would be so seldom used that it should be coded for 
minimum space. But there may be applications in which it is 
desirable to compute the multiplicative inverse quickly. 


The *Newton method" described here applies only when (1) 
the modulus is an integral power of some number a, and (2) the 
multiplicative inverse of d modulo a is known. It works 
particularly well for a — 2, because then the multiplicative 
inverse of any (odd) number d modulo 2 is known immediately 
—it is 1. 


Sample Multiplicative Inverses 


We conclude this section with a listing of some multiplicative 
inverses in Table 10-3. 


TABLE 10-3. SAMPLE MULTIPLICATIVE INVERSES 


49249249 
33333333 
55555555 
FFFFFFFF 
1 
AAAAAAAB 
CCCCCCCD 
B6DB6DB7 
38E3 8E39 
BA2E 8BA3 
C4EC4EC5 


EEEEEEEF 
C28F5C29 


26E978D5 
3AFB7E91 


mod 16 mod 232 
== че (һех) 


9249249249249249 
3333333333333333 
55555555 55555555 
FFFFFFFFFFFFFFFF 
1 
AAAAAAAA AAAAAAAB 
CCCCCCCC CCCCCCCD 
6DB6DB6D B6DB 6DB7 
8E38E38E 38E38E39 
2E8BA2E8 BA2E8BA3 
4EC4 EC4EC4EC4EC5 


EEEE EEEE EEEEEEEF 
8F5C28F5 C28F5C29 


1CAC083126E978D5 
D288CE70 3AFB7E91 


You may notice that in several cases (d — 3, 5, 9, 11), the 
multiplicative inverse of d is the same as the magic number for 
unsigned division by d (see Section 10-14, *Sample Magic 
Numbers," on page 238). This is more or less a coincidence. It 
happens that for these numbers, the magic number M is equal to 
the multiplier m, and these are of the form (2p + 1) / d, with p 
> 32. In this case, notice that 


Md = ( )а=1 (mod 232), 


2P + 1 
а 
so that M = d (mod 232). 


10-17 Test for Zero Remainder after Division by a 
Constant 


The multiplicative inverse of a divisor d can be used to test for a 
zero remainder after division by d[GM]. 


Unsigned 


First, consider unsigned division with the divisor d odd. Denote 
by d the multiplicative inverse of d. Then, because 


dd=1 (mod 2 "). where W is the machine's word size in bits, 
d is also odd. Thus, d is relatively prime to 2w, and as shown in 
the proof of theorem MI in the preceding section, as n ranges 


over all 2w distinct values modulo 2w, nd takes on all 2w 
distinct values modulo 2w. 


It was shown in the preceding section that if n is a multiple of 
d, 


П = nd (mod 2”). 
d 


That is, for = 0, d, 2d,..., |(2W- 1) / d | d, nd = 0, 1,2, ..., 
L(2w - 1) / d j(mod 2w). Therefore, for n not a multiple of d, the 


value of nd, reduced modulo 2w to the range 0 to 2W - 1, must 
exceed | (2W - 1) / d |. 


This can be used to test for a zero remainder. For example, to 


test if an integer n is a multiple of 25, multiply n by 25 and 
compare the rightmost W bits to | (2W- 1) / 25 |. On our basic 


RISC: 


Click here to view code image 


li M,0xC28F5C29 Load mult. inverse of 25. 
mul q,M,n q = right half of M*n. 

li c,0x0A3D70A3 с = floor((2**32-1)/25). 
cmpleu t,q,c Compare q and c, and branch 
bt t,is mult if n is a multiple of 25. 


To extend this to even divisors, let d = do : 2k, where d, is 
odd and k = 1. Then, because an integer is divisible by d if and 


only if it is divisible by dọ and by 2k, and because n and nd, 


have the same number of trailing zeros (do is odd), the test that 
n is a multiple of d is 


Set q = mod(nd,, 2"); 
q € | Q" – 1)/d, | and q ends in К or more 0-bits, 


where the mod function is understood to reduce nd to the 
interval [0, 2w --1]. 


Direct implementation of this requires two tests and 


conditional branches, but it can be reduced to one compare- 
branch quite efficiently if the machine has the rotate-shift 
instruction. This follows from the following theorem, in which 


а >> К denotes the computer word a rotated right К positions (0 
< К = 32). 


THEOREM ZRU. X < 4 and x ends т К O-bits if and only if 
x Sk La/2* | 

Proof. (Assume a 32-bit machine.) Suppose X < (and x ends 
in k O-bits. Then, because Х < d, Lx/2* | < La/ 2* | But 
|х/2 | = x >К Therefore, Х Sk < La/2* | tf x does not end 
in К O-bits, then, X >k does not begin with К O-bits, whereas |a 


/2k | does, so Х > kt|a/ 2* | Lastly, if X а and x ends in К 0- 
bits, then the integer formed from the first 32 — k bits of x must 


exceed that formed from the first 32 - k bits of a, so that 


[х/2* | 3 La/2* | 


Using this theorem, the test that n is a multiple of d, where n 
and d > 1 are unsigned integers and d = ds : 2k with do odd, is 


q < mod(nd,, 2”); 


q К<|(23”-1)/4 |. 


Here we used | | (2W- 1) / do | / 2k | = ((2w- 1) / (dg: 2k); = 
(2w- 1) / d. 


As an example, the following code tests an unsigned integer n 
to see if it is a multiple of 100: 


Click here to view code image 


li M,0xC28F5C29 Load mult. inverse of 25. 
mul q,M,n q = right half of M*n. 
shrri q,q,2 Rotate right two positions. 
li с,0х028Е5С28 с = floor((2**32-1)/100). 
cmpleu t,q,c Compare q and c, and branch 
bt t,is mult if n is a multiple of 100. 


Signed, Divisor > 2 


For signed division, it was shown in the preceding section that if 
nis a multiple of d and d is odd, then 


7 = nd (mod 2"). 
2 


Thus, for n = T -2w- 1 /d 1- d, ...,—d,0, d, ..|(2w-1-1)/d |: 


d, we have nd = Г- 2w- 1 / d,...—1,0, 1, .. 2w-1 - 1) /d | 
(mod 2w). Furthermore, because d is relatively prime to 2W, as п 


ranges over all 2w distinct values modulo 2w, nd takes on all 
2w distinct values modulo 2w. Therefore, n is a multiple of d if 
and only if 


[-2"-1/а] < mod(nd, 2”) x | (2"-! — 1)/4d |, 
where the mod function is understood to reduce md to the 
interval —2w - 1 2w- ! -1] 


This can be simplified a little by observing that because d is 
odd and, as we are assuming, positive and not equal to 1, it does 
not divide 2W-1. Therefore, 


[-2W-1/d|l-2T(-2Ww-1 + 1) ат = - (20-1 -1)/dj. 


Thus, for signed numbers, the test that n is a multiple of d, 
where d = do: 2k and d, is odd, is 

Setq — mod do, 2W); 

4 (2w-1 -1)/ do | < q x | (2w-1-1)/do | and q ends in k 
or more 0-bits. 


On the surface, this would seem to require three tests and 
branches. However, as in the unsigned case, it can be reduced to 
one compare-branch by use of the following theorem: 


THEOREM ZRS. If a = 0, the following assertions are equivalent: 
(1) —a € x € a and x ends in К or more 0-bits, 
(2) abs(x) 2k € | a/2* |, and 


(3) x * a' 2k [2а'/2* |, 
where a’ is a with its rightmost k bits set to O (that is, a^ = a& 
-2К). 


Proof. (Assume a 32-bit machine). To see that (1) is 
equivalent to (2), clearly the assertion - а < x < a is equivalent 
to abs(x) < a. Then, Theorem ZRU applies, because both sides 


of this inequality are nonnegative. 


To see that (1) is equivalent to (3), note that assertion (1) is 
equivalent to itself with a replaced with a’. Then, by the 
theorem on bounds checking on page 68, this in turn is 
equivalent to 


х+а'<2а'. 
Because х + a’ ends іп К 0-bits if and only if x does, Theorem 
ZRU applies, giving the result. 


Using part (3) of this theorem, the test that n is a multiple of 
d, where n and d = 2 are signed integers and d = do · 2k with do 
odd, is 


q < mod(nd,, 2%); 
a’ & | Q*-1—1)/4, | & 2*; 


qta’ к< | (2a')/2 |. 


(a’ can be computed at compile time, because d is a constant.) 


As an example, the following code tests a signed integer n to 
see if it is a multiple of 100. Notice that the constant | 2α’ / 2k | 
can always be derived from the constant a’ by a shift of k - 1 
bits, saving an instruction or a load from memory to develop the 
comparand. 


Click here to view code image 


li M,0xC28F5C29 Load mult. inverse of 25. 

mul q,M,n q = right half of M*n. 

li с,0х051ЕВ850 с = floor((2**31 - 1)/25) & - 
4. 

add а, а, с Ааа с. 

shrri а,а,2 Rotate right two positions. 
shri QV Compute const. for 
comparison. 

cmpleu t,q,c Compare q and c, and 

bt t,is mult branch if n is a mult. of 
100. 


10-18 Methods Not Using Multiply High 


In this section we consider some methods for dividing by 
constants that do not use the multiply high instruction, or a 
multiplication instruction that gives a double-word result. We 
show how to change division by a constant into a sequence of 
shift and add instructions, or shift, add, and multiply for more 
compact code. 


Unsigned Division 


For these methods, unsigned division is simpler than signed 
division, so we deal with unsigned division first. One method is 
to use the techniques given that use the multiply high instruction, 
but use the code shown in Figure 8-2 on page 174 to do the 
multiply high operation. Figure 10-7 shows how this works out 
for the case of (unsigned) division by 3. This is a combination of 
the code on page 228 and Figure 8-2 with “int” changed to 
"unsigned." The code is 15 instructions, including four 
multiplications. The multiplications are by large constants and 
would take quite a few instructions if converted to shift's and 
add's. Very similar code can be devised for the signed case. This 
method is not particularly good and won't be discussed further. 


Another method [GLS1] is to compute in advance the 
reciprocal of the divisor, and multiply the dividend by that with 
a series of shift right and add instructions. This gives an 
approximation to the quotient. It is merely an approximation, 
because the reciprocal of the divisor (which we assume is not an 
exact power of two) is not expressed exactly in 32 bits, and also 
because each shift right discards bits of the dividend. Next, the 
remainder with respect to the approximate quotient is 
computed, and that is divided by the divisor to form a 
correction, which is added to the approximate quotient, giving 
the exact quotient. The remainder is generally small compared 
to the divisor (a few multiples thereof), so there is often a simple 
way to compute the correction without using a divide 
instruction. 


To illustrate this method, consider dividing by 3, that is, 
computing |n / 3 | where 0 < n < 222, The reciprocal of 3, in 
binary, is approximately 


0.0101 0101 0101 0101 0101 0101 0101 0101. 


To compute the approximate product of that and n, we could use 


q € (п5 2) (n3 4) + (n 6) + ... + (n 30) (32) 


Click here to view code image 


unsigned divu3 (unsigned n) { 
unsigned nO, nl, м0, wl, w2, t, а; 


n0 = n & OxFFFF; 


01 = п >> 16; 

w0 = nO*OxAAAB; 

Е = nl*OxAAAB + (м0 >> 16); 
wl = t & OxFFFF; 

м2 = t >> 16; 


wl = nO*OxAAAA + м1; 
q = n1*OxAAAA + w2 + (wl >> 16); 
return q >> 1; 


FIGURE 10-7. Unsigned divide by 3 using simulated multiply 
high unsigned. 


(29 instructions; the last 1 in the reciprocal is ignored because it 
u . 

would add the term И >> 32. which is obviously 0). However, 

the simple repeating pattern of 15 and 05 in the reciprocal 

permits a method that is both faster (nine instructions) and more 

accurate: 


а= (n> 2)+ (n> 4) 


€ q*(qA) 
q«q*iq (1) 
q — q+ (q >> 8) 


а= q+ (q 2 16) 

To compare these methods for their accuracy, consider the 
bits that are shifted out by each term of (32), if n is all 1-bits. 
The first term shifts out two 1-bits, the next four 1-bits, and so 
on. Each of these contributes an error of almost 1 in the least 
significant bit. Since there are 16 terms (counting the term we 
ignored), the shifts contribute an error of almost 16. There is an 
additional error due to the fact that the reciprocal is truncated to 
32 bits; it turns out that the maximum total error is 16. 


For procedure (1), each right shift also contributes an error of 


almost 1 in the least significant bit. But there are only five shift 
operations. They contribute an error of almost 5, and there is a 
further error due to the fact that the reciprocal is truncated to 32 
bits; it turns out that the maximum total error is 5. 


After computing the estimated quotient q, the remainder r is 
computed from 


r—n-q* З. 


The remainder cannot be negative, because q is never larger 
than the exact quotient. We need to know how large r can be to 
devise the simplest possible method for computing +3. In 
general, for a divisor d and an estimated quotient q too low by k, 
the remainder will range from k*d to k*d + d - 1. (The upper 
limit is conservative; it may not actually be attained.) Thus, 
using (1), for which q is too low by at most 5, we expect the 
remainder to be at most 5*3 + 2 = 17. Experimentation reveals 
that it is actually at most 15. Thus, for the correction we must 
compute (exactly) 


r*3, forO<r<15. 


Since r is small compared to the largest value that a register 
can hold, this can be approximated by multiplying r by some 
approximation to 1/3 of the form a/b where b is a power of 2. 
This is easy to compute, because the division is simply a shift. 
The value of a/ b must be slightly larger than 1/3, so that after 
shifting the result will agree with truncated division. A sequence 
of such approximations is: 


1/2, 2/4, 3/8, 6/16, 11/32, 22/64, 43/128, 86/256, 171/512, 
342/1024, .... 


Usually, the smaller fractions in the sequence are easier to 
compute, so we choose the smallest one that works; in the case 
at hand this is 11/32. Therefore, the final, exact, quotient is 
given by 


4—9 + (11*г >> 5). 
The solution involves two multiplications by small numbers 
(3 and 11); these can be changed to shift's and add's. 


Figure 10-8 shows the entire solution in C. As shown, it 
consists of 14 instructions, including two multiplications. If the 
multiplications are changed to shift's and add's, it amounts to 18 
elementary instructions. However, if it is desired to avoid the 
multiplications, then either alternative return statement shown 
gives a solution in 17 elementary instructions. Alternative 2 has 
just a little instruction-level parallelism, but in truth this method 
generally has very little of that. 


A more accurate estimate of the quotient can be obtained by 
changing the first executable line to 
Click here to view code image 


а = (п >> 1) + (п >> 3); 


(which makes q too large by a factor of 2, but it has one more 
bit of accuracy), and then inserting just before the assignment to 


г, 
Click here to view code image 
q = q >> 1; 


With this variation, the remainder is at most 9. However, there 
does not seem to be any better code for calculating Γ #3. with г 
limited to 9 than there is for r limited to 15 (four elementary 
instructions in either case). Thus, using the idea would cost one 
instruction. This possibility is mentioned because it does give a 
code improvement for most divisors. 


Click here to view code image 


unsigned divu3(unsigned n) { 
unsigned q, r; 


а = (n >> 2) + (n >> 4); // q = n*0.0101 (approx). 
а=а + (а >> 4); // q = n*0.01010101. 
а=а + (а >> 8); 
а= q+ (а >> 16); 
r n - q*3; Tf. «x15. 
return а + (11*r >> 5); // Returning q + r/3. 

// return q + (5*(r + 1) >> 4); // Alternative 1. 

( 


// return а + ((r + 5 + (r << 2)) >> 4);// Alternative 2. 


} 


FIGURE 10-8. Unsigned divide by 3. 


Figure 10-9 shows two variations of this method for dividing 
by 5. The reciprocal of 5, in binary, is 


0.0011 0011 0011 0011 0011 0011 0011 0011. 


As in the case of division by 3, the simple repeating pattern of 
1’s and 0’s allows a fairly efficient and accurate computing of 
the quotient estimate. The estimate of the quotient computed by 
the code on the left can be off by at most 5, and it turns out that 
the remainder is at most 25. The code on the right retains two 
additional bits of accuracy in computing the quotient estimate, 
which is off by at most 2. The remainder in this case is at most 
10. The smaller maximum remainder permits approximating 1/5 
by 7/32 rather than 13/64, which gives a slightly more efficient 
program if the multiplications are done by shift's and add's. The 
instruction counts are, for the code on the left: 14 instructions 
including two multiplications, or 18 elementary instructions; for 
the code on the right 15 instructions including two 
multiplications, or 17 elementary instructions. The alternative 
code in the return statement is useful only if your machine has 
comparison predicate instructions. It doesn't reduce the 
instruction count, but merely has a little instruction-level 
parallelism. 


For division by 6, the divide-by-3 code can be used, followed 
by a shift right of 1. However, the extra instruction can be saved 
by doing the computation directly, using the binary 
approximation 


4/6 = 0.1010 1010 1010 1010 1010 1010 1010 1010. 


unsigned divu5a(unsigned п) ( |unsigned divu5b(unsigned n) 4 
unsigned q, r; unsigned q, r; 


(n >> 3) + (n >> 4); 
а + (а >> 4); 

а + (а >> 8); 

а + (а >> 16); 

п - q*5; 

return а + (13*r >> 6); 


(п >> 1) + (n >> 2); 
а + (а >> 4); 

а + (а >> 8); 

а + (а >> 16); 

а >> 2; 

п - q*5; 

return а + (7*r >> 5); 

// return а + (r»4) + (р>9); 


) 


q 
q 
q 
q 
r 


FIGURE 10-9. Unsigned divide by 5. 


The code is shown in Figure 10-10. The version on the left 
multiplies by an approximation to 1/6 and then corrects with a 
multiplication by 11/64. The version on the right takes 
advantage of the fact that by multiplying by an approximation to 
4/6, the quotient estimate is off by only 1 at most. This permits 
simpler code for the correction; it simply adds 1 to аі r = 6. 
The code in the second return statement is appropriate if the 
machine has the comparison predicate instructions. Function 
divu6b is 15 instructions, including one multiply, as shown, or 17 
elementary instructions if the multiplication by 6 is changed to 
shift's and add's. 


unsigned divu6a(unsigned n) { |unsigned divu6b(unsigned n) { 
unsigned q, r; unsigned q, r; 


(n >> 3) + (n >> 5); (n >> 1) + (n >> 3); 


Ч = а = 
а= + (а >> 4) а = q + (а >> 4); 
а =а + (а >> 8); а =а + (а >> 8); 
а= q+ (q >> 16); q = q + (а >> 16); 
г = п - q*6; q = q >> 2; 
return q + (11*r >> 6); r = n - q*6; 
} return а + ((r + 2) >> 3); 
// return q + (r» 5); 


) 
FIGURE 10-10. Unsigned divide by 6. 


For larger divisors, usually it seems to be best to use an 
approximation to 1/ d that is shifted left so that its most 
significant bit is 1. It seems that the quotient is then off by at 
most 1 usually (possibly always, this writer does not know), 
which permits efficient code for the correction step. Figure 10-- 
11 shows code for dividing by 7 and 9, using the binary 
approximations 


4/7 = 0.1001 0010 0100 1001 0010 0100 1001 0010, and 


8/9 = 0.1110 0011 1000 1110 0011 1000 1110 0011. 
If the multiplications by 7 and 9 are expanded into shift's and 
add's, these functions take 16 and 15 elementary instructions, 
respectively. 


unsigned divu9(unsigned n) ( 
unsigned q, r; 


unsigned divu7(unsigned n) { 
unsigned q, r; 


q = (n >> 1) + (п >> 4); q =n = (ñ >> 3); 

а =а + (а >> 6); q = q + (а >> 6); 

q = q + (q>>12) + (q>>24); q = q + (q>>12) + (q>>24); 
q = q >> 21 q = q >> 3; 

r = n - q*7; r = n - q*9; 


return q + ((r + 7) >> 4); 
// return q + (r > 8); 


) 
FIGURE 10-11. Unsigned divide by 7 and 9. 


return q + ((r + 1) >> 3); 
// return q + (r > 6); 


) 


Figures 10-12 and 10-13 show code for dividing by 10, 11, 
12, and 13. These are based on the binary approximations: 


8/10 = 0.1100 1100 1100 1100 1100 1100 1100 1100, 
8/11 = 0.1011 1010 0010 1110 1000 1011 1010 0010, 
8/12 0.1010 1010 1010 1010 1010 1010 1010 1010, and 


8/13 = 0.1001 1101 1000 1001 1101 1000 1001 1101. 


If the multiplications are expanded into shift’s and add's, these 
functions take 17, 20, 17, and 20 elementary instructions, 
respectively. 


unsigned divul0(unsigned п) ( | и 1 514пеа4 divull(unsigned n) 4 
unsigned q, r; unsigned q, r; 


(n >> 1) + (n >> 2); q = (n >> 1) + (n >> 2) - 
а + (а >> 4); (п >> 5) + (п >> 7); 
а + (а >> 8); а + (а>> 10); 
а + (а >> 16); а + (а >> 20); 
а >> 3; а >> 3; 
п - q*10; п - q*11; 
return q + ((r + 6) >> 4); return а + ((r + 5) >> 4); 
// return а + (r > 9); // return а + (r > 10); 


} 


q 
q 
q 
q 
q 
r 


FIGURE 10-12. Unsigned divide by 10 and 11. 


unsigned divul2(unsigned n) { |unsigned divul3(unsigned n) { 


unsigned q, r; unsigned q, r; 
q = (n >> 1) + (n >> 3); q = (n>>1) + (n>>4); 
а = а + (а >> 4); q = q + (q>>4) + (q>>5); 
q = q + (q >> 8); q = q + (q>>12) + (q>>24); 
q = q + (q >> 16); q = q >> 3; 
q = q >> 3; г = n - q*13; 
r = n - q*12; return а + ((г + 3) >> 4); 
return q + ((r + 4) >> 4); |// return q + (r > 12); 

// return а + (r > 11); } 


} 


FIGURE 10-13. Unsigned divide by 12 and 13. 


The case of dividing by 13 is instructive because it shows 
how you must look for repeating strings in the binary expansion 
of the reciprocal of the divisor. The first assignment sets q equal 
to n*0.1001. The second assignment to q adds n*0.00001001 
and n*0.000001001. At this point, q is (approximately) equal to 
n*0.100111011. The third assignment to q adds in repetitions of 
this pattern. It sometimes helps to use subtraction, as in the case 
of divu9 above. However, you must use саге with subtraction, 
because it may cause the quotient estimate to be too large, in 
which case the remainder is negative and the method breaks 
down. It is quite complicated to get optimal code, and we don't 
have a general cookbook method that you can put in a compiler 
to handle any divisor. 


The examples above are able to economize on instructions, 
because the reciprocals have simple repeating patterns, and 
because the multiplication in the computation of the remainder 
r is by a small constant, which can be done with only a few 
shifts and add's. One might wonder how successful this method 
is for larger divisors. To roughly assess this, Figures 10-14 and 
10-15 show code for dividing by 100 and by 1000 (decimal). 
The relevant reciprocals are 


64/100 « 0.1010 0011 1101 0111 0000 1010 0011 1101 and 


512/1000 = 0.1000 0011 0001 0010 0110 1110 1001 0111. 


If the multiplications are expanded into shift’s and add's, these 
functions take 25 and 23 elementary instructions, respectively. 


Click here to view code image 


unsigned divul100 (unsigned n) 4 
unsigned q, r; 


а = (п >> 1) + (п >> 3) + (п >> 6) = (п >> 10) + 
(n >> 12) + (n >> 13) - (n >> 16); 
а=а + (а >> 20); 
а = q >> 6; 
r =n - q*100; 
return а + ((r + 28) >> 7); 
// return q + (r > 99); 
} 


FIGURE 10-14. Unsigned divide by 100. 


Click here to view code image 


unsigned divul000 (unsigned n) { 
unsigned q, r, t; 


Е = (п >> 7) + (п >> 8) + (n >> 12); 

а = (п >> 1) +t (n >> 15) + (t >> 11) + (t >> 14); 
q = q >> 9; 

r = n - q*1000; 


return q + ((r + 24) >> 10); 
// return q + (r > 999); 
} 


FIGURE 10-15. Unsigned divide by 1000. 


In the case of dividing by 1000, the least significant eight bits 
of the reciprocal estimate are nearly ignored. The code of Figure 
10-15 replaces the binary 1001 0111 with 0100 0000, and still 
the quotient estimate is within one of the true quotient. Thus, it 
appears that although large divisors might have very little 
repetition in the binary representation of the reciprocal estimate, 
at least some bits can be ignored, which helps hold down the 
number of shift’s and add’s required to compute the quotient 
estimate. 


This section has shown, in a somewhat imprecise way, how 
unsigned division by a constant can be reduced to a sequence of, 
typically, about 20 elementary instructions. It is nontrivial to get 
an algorithm that generates these code sequences that is suitable 
for incorporation into a compiler, because of three difficulties in 
getting optimal code. 


1. It is necessary to search the reciprocal estimate bit string 
for repeating patterns. 


2. Negative terms (as in aivuio and aivui00) can be used 
sometimes, but the error analysis required to determine 
just when they can be used is difficult. 


3. Sometimes some of the least significant bits of the 
reciprocal estimate can be ignored (how many?). 


Another difficulty for some target machines is that there are 
many variations on the code examples given that have more 
instructions, but that would execute faster on a machine with 
multiple shift and add units. 


The code of Figures 10-7 through 10-15 has been tested for 
all 232 values of the dividends. 


Signed Division 


The methods given above can be made to apply to signed 
division. The right shift instructions in computing the quotient 
estimate become signed right shift instructions, which compute 
floor division by powers of 2. Thus, the quotient estimate is too 
low (algebraically), so the remainder is nonnegative, as in the 
unsigned case. 


The code most naturally computes the floor division result, so 
we need a correction to make it compute the conventional 
truncated-toward-0 result. This can be done with three 
computational instructions by adding d - 1 to the dividend if the 
dividend is negative. For example, if the divisor is 6, the code 
begins with (the shift here is a signed shift) 


Click here to view code image 
п= п + (п >> 31 & 5); 


Other than this, the code is very similar to that of the 
unsigned case. The number of elementary operations required is 
usually three more than in the corresponding unsigned division 
function. Several examples are given in Figures 10-16 through 
10-22. All have been exhaustively tested. 


Click here to view code image 


int divs3(int n) { 


int q, r; 


η =n + (n>>31 & 2); // Add 2 if n < 0. 
а = (п >> 2) + (п >> 4); // а = n*0.0101 (approx). 
а=а + (а >> 4); // q = n*0.01010101. 
а=а + (а >> 8); 
а= а + (а >> 16); 
r = n - q*3; // 0 <= r <= 14. 
return q + (11*r >> 5); // Returning q + r/3. 
// return q + (5*(r + 1) >> 4); // Alternative 1. 
( 


// return q + ((r + 5 + (r«« 2)) >> 4);// Alternative 2. 


} 


FIGURE 10-16. Signed divide by 3. 


int divs5(int n) ( int divs6(int n) ( 
i q, r; i q, r; 


=n + (n»»31 8 4); n + (n>>31 & 5); 

= (n»» 1) + (n >> 2); = (n >> 1) + (n >> 3); 

= а + (а >> 4); = q + (а >> 4); 

а + (а >> 8); а + (а >> 8); 

= q + (а >> 16); = q + (а >> 16); 

= q >> 2; = q >> 2; 

= n - q*5; = n - q*6; 

return q + (7*r >> 5); return а + ((г + 2) >> 3); 

// return q + (r»4) + (r»9); |// return а + (r > 5); 


) ) 
FIGURE 10-17. Signed divide by 5 and 6. 


int divs7(int n) ( int divs9(int n) ( 
int q, r; int q, r; 


п + (02231 8 6); = п + (n>>31 ἃ 8); 
= (n >> 1) + (n >> 4); = (n >> 1) + (n >> 2) + 
=q + (q >> 6); (n >> 3); 

Ч + (42212) + (42224); а + (а>> 6); 


= а >> 2; = q + (62212) + (42224); 
=n - 4*7; = а >> 3; 
return а + ((r + 1) >> 3); = n - q*9; 
// return а + (r > 6); return а + ((r + 7) >> 4); 
) // return q * (r» 8); 
) 


FIGURE 10-18. Signed divide by 7 and 9. 


int divslO(int n) ( int divsll(int n) ( 
int q, r; int а, г; 


= n + (n>>31 & 9); п = η + (n>>31 ἃ 10); 

= (п >> 1) + (n >> 2); (п >> 1) + (n >> 2) - 

= а + (а >> 4); (п >> 5) + (n >> 7); 

= q + (а >> 8); =q + (а>> 10); 

=q + (а>> 16); =q + (q >> 20); 

=q >> 3; =q >> 3; 

=n - q*10; - n - q*11; 

return q + ((r + 6) >> 4); return а + ((r + 5) >> 4); 

// return q + (r» 9); // return q + (r > 10); 


} } 
FIGURE 10-19. Signed divide by 10 and 11. 


int divsl2(int n) { int divsl3(int n) ( 
q, r; i q, r; 


п + (n>>31 ἃ 11); n + (n>>31 & 12); 
= (n >> 1) + (n >> 3); = (n>>1) + (n>>4); 
= q + (а >> 4); = q + (q>>4) + (q>>5); 
=q + (q >> 8); = q + (q>>12) + (q>>24); 
= а + (а >> 16); а >> 3; 
= а >> 3; = n - q*13; 
=n - q*12; return q + ((r* 3) >> 4); 
return а + ((r + 4) >> 4); |// return а + (r > 12); 
// return а + (r > 11); } 
} 


FIGURE 10-20. Signed divide by 12 and 13. 


int divsl00(int n) 4 
int а, г; 


п + (n>>31 8 99); 
= (п >> 1) + (n >> 3) + (n >> 6) - (n >> 10) + 
(п >> 12) + (n >> 13) - (n >> 16); 


= q + (q >> 20); 
q >> 6; 
n - q*100; 
return q + ((r + 28) >> 7); 
// return q + (r > 99); 
) 


FIGURE 10-21. Signed divide by 100. 


Click here to view code image 


int divs1000 (int n) 4 
int q, r, t; 


+ (n >> 31 & 999); 


= d 

t = (n >> 7) + (n >> 8) + (n >> 12); 

а = (п >> 1) +t (п >> 15) + (t >> 11) + (t >> 14) + 
(n >> 26) + (t >> 21); 

q=q >> 9; 

r =n - q*1000; 


return q + ((r + 24) >> 10); 
// return q + (r > 999); 
} 


FIGURE 10-22. Signed divide by 1000. 
10-19 Remainder by Summing Digits 


This section addresses the problem of computing the remainder 
of division by a constant without computing the quotient. The 
methods of this section apply only to divisors of the form 2k + 
1, for k an integer greater than or equal to 2, and in most cases 
the code resorts to a table lookup (an indexed load instruction) 
after a fairly short calculation. 


We will make frequent use of the following elementary 
property of congruences: 


THEOREM C. If a = b (mod m) and c = d (mod m), then 


atczb-*d(mod m) and 
ac = bd (mod т). 


The unsigned case is simpler and is dealt with first. 


Unsigned Remainder 


For a divisor of 3, multiplying the trivial congruence 1 = 1 
(mod 3) repeatedly by the congruence 2 = -1 (mod 3), we 
conclude by Theorem C that 


1(то4 3), k even, 
-1(то4 3). k odd. 


Therefore, a number n written in binary as ...b3 b» Бу bo satisfies 


2 


n=... + b3: 23 + bo: 22 + 1:2 + bg = ...– ba + ba3-bi + 
bo (mod 3), 


which is derived by using Theorem C repeatedly. Thus, we can 
alternately add and subtract the bits in the binary representation 
of the number to obtain a smaller number that has the same 
remainder upon division by 3. If the sum is negative, you must 
add a multiple of 3 to make it nonnegative. The process can then 
be repeated until the result is in the range O to 2. 


The same trick works for finding the remainder after dividing 
a decimal number by 11. 


Thus, if the machine has the population count instruction, a 
function that computes the remainder modulo 3 of an unsigned 
number n might begin with 


Click here to view code image 

n = pop(n & 0x55555555) - pop(n & OxAAAAAAAA); 
This can be simplified by using the following surprising identity 
discovered by Paolo Bonzini [Bonz]: 


pop(x & т) — pop(x & m) = pop(x @ т) — pop(m). (2) 


Proof: 


pop(x & т) — рор(х & m) 


— pop(x & m) — (32 — pop(x & m)) рор(а) = 32 — pop(a). 
= pop(x & т) + pop(X | m) — 32 DeMorgan. 
= рор(х & т)  pop(X & т) + pop(m) – 32 рор(а | b) = 

рор(а ἃ: b) + pop(b). 


= pop((x & m) | (x & m)) + pop(m) – 32 Disjoint. 

= pop(x @ т) — рор(т) 
Since the references to 32 (the word size) cancel out, the result 
holds for any word size. Another way to prove (2) is to observe 
that it holds for x = 0, and if a 0-Ы in x is changed to a 1 
where m is 1, then both sides of (2) decrease by 1, and if a O-bit 
of x is changed to a 1 where т is 0, then both sides of (2) 
increase by 1. 


Applying (2) to the line of C code above gives 


Click here to view code image 


n — pop(n ^ OxAAAAAAAA) - 160; 


We want to apply this transformation again, until n is in the 
range 0 to 2, if possible. It is best to avoid producing a negative 
value of n, because the sign bit would not be treated properly on 
the next round. A negative value can be avoided by adding a 
sufficiently large multiple of 3 to ». Bonzini's code, shown in 
Figure 10-23, increases the constant by 39. This is larger than 
necessary to make » nonnegative, but it causes n to range from 
-3 to 2 (rather than -3 to 3) after the second round of reduction. 
This simplifies the code on the return statement, which is 
adding 3 if n is negative. The function executes in 11 
instructions, counting two to load the large constant. 


Figure 10-24 shows a variation that executes in four 
instructions, plus a simple table lookup operation (e.g. an 
indexed load byte instruction). 


Click here to view code image 


int remu3 (unsigned n) { 


n = pop(n ^ OxAAAAAAAA) + 23; // Now 23 <= n <= 
55. 

n = pop(n ^ Ox2A) - 3; // Now -3 <= n <= 
2. 

return n + (((int)n >> 31) & 3); 


FIGURE 10-23. Unsigned remainder modulo 3, using 
population count. 


Click here to view code image 


int remu3 (unsigned n) { 


static char table[33] = (2, 0,1,2, 0,1,2, 0,1,2, 
0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 
0,1,2, 0,1}; 


n — pop(n ^ OxAAAAAAAA); 
return table[n]; 


FIGURE 10-24. Unsigned remainder modulo 3, using 


population count and a table lookup. 


To avoid the population count instruction, notice that because 
4 = 1 (mod 3), 4k = 1 (mod 3). A binary number can be viewed 
as a base 4 number by taking its bits in pairs and interpreting 
the bits 00 to 11 as a base 4 digit ranging from 0 to 3. The pairs 
of bits can be summed using the code of Figure 5-2 on page 82, 
omitting the first executable line (overflow does not occur in the 
additions). The final sum ranges from 0 to 48, and a table 
lookup can be used to reduce this to the range O to 2. The 
resulting function is 16 elementary instructions, plus an indexed 
load. 


There is a similar, but slightly better, way. As a first step, n 
can be reduced to a smaller number that is in the same 
congruence class modulo 3 with 


Click here to view code image 
п = (п >> 16) + (n & OxFFFF); 


This splits the number into two 16-bit portions, which are added 
together. The contribution modulo 3 of the left 16 bits of n is 
not altered by shifting them right 16 positions, because the 
shifted number, multiplied by 216, is the original number, and 
216 = 1 (mod 3). More generally, 2k = 1 (mod 3) if k is even. 
This is used repeatedly (five times) in the code shown in Figure 
10-25. This code is 19 instructions. The instruction count can be 
reduced by cutting off the digit summing earlier and using an in- 
memory table lookup, as illustrated in Figure 10-26 (nine 
instructions, plus an indexed load). The instruction count can be 
reduced to six (plus an indexed load) by using a table of size 
Ox2FE = 766 bytes. 


To compute the unsigned remainder modulo 5, the code of 
Figure 10-27 uses the relations 16k = 1 (mod 5) and 4 = -1 
(mod 5). It is 21 elementary instructions, assuming the 
multiplication by 3 is expanded into a shift and add. 


Click here to view code image 


int remu3 (unsigned n 


) 
n 


( 
п = (п >> 16) + (п & OxFFFF); // Мах OX1FFFE. 
п = (п > 8) + (па OxOOFF); // Max Ox2FD. 
n= (n > 4) + (п & Ox000F); // Max Ox3D. 


n= (n»» 2) + (n & 0x0003); // Max 0х11. 
n= (n >> 2) + (n & 0x0003); // Max 0х6. 
return (0х0924 >> (п << 1)) & 3; 


— 


FIGURE 10-25. Unsigned remainder modulo 3, digit summing 
and an in-register lookup. 


Click here to view code image 


int remu3 (unsigned n) { 

static char table[62] = (0,1,2, 0,1,2, 0,1,2, 0,1,2, 
05125 το О 25-0,1;25- 0,1,2, 0,10 22:01:22, 
0,1,2, 0,1,2, 0,1,2, 0,1,2, OF 1,25 0,1,2, 0,1,2, 
бй». Ор Oll 

п = (n >> 16) + (па OXFFFF) ; // Max Ox1FFFE. 

n= (n»» 8) + (n & OxOOFF); // Max Ox2FD. 

п = (п > 4) + (пя 0x000F); // Max Ox3D. 

] 


— 


FIGURE 10-26. Unsigned remainder modulo 3, digit summing 
and an in-memory lookup. 


Click here to view code image 


int remu5 (unsigned n) { 


n = (n >> 16) + (n & OxFFFF); // Max 
OxlFFFE. 

n= (n»» 8) + (n & OxOOFF); // Max Ox2FD. 

п = (п > 4) + (пя 0x000F); // Max Ox3D. 

n = (n>>4) - ((n>>2) & 3) + (n& 3); // -3 to 6. 

return (01043210432 >> 3*(n + 3)) & 7; // Octal 
const. 


-- 


FIGURE 10-27. Unsigned remainder modulo 5, digit summing 
method. 


The instruction count can be reduced by using a table, similar 
to what is done in Figure 10-26. In fact, the code is identical, 
except the table is: 


Click here to view code image 


static char tabl 
01,253; 


Opel 27:34 
0172593 


4, 0, 
4, 0, 
4, 0 


, 


т н нњ Ф 
` 
NONIS м 
М 


ГуРу , 


For the unsigned remainder modulo 7, the code of Figure 10- 
28 uses the relation 8k = 1 (mod 7) (nine elementary 
instructions, plus an indexed load). 


As a final example, the code of Figure 10-29 computes the 
remainder of unsigned division by 9. It is based on the relation 8 
= —] (mod 9). As shown, it is nine elementary instructions, plus 
an indexed load. The elementary instruction. count can be 
reduced to six by using a table of size 831 (decimal). 


Click here to view code image 


int remu7 (unsigned n) { 


static char table[75] = {0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 
ο νο Зу: бу Ὁ. 172,374,576; ОТ, Άδη δη 
(051,25,;3,4,5,6, 0,1,2,3,4,5;0, ο. 2; ЗА; 56, 
0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4}; 

n (n >> 15) + (n 8 Ox7FFF); // Max Ox27FFE. 

n= (п > 9) + (па 0х001ЕЕ): // Max Ox33D. 

n= (n»» 6) + (n & 0x0003F); // Max Ox4A. 

return table[n]; 


FIGURE 10-28. Unsigned remainder modulo 7, digit summing 
method. 


Click here to view code image 


int remu9 (unsigned n) { 


int r; 

static char table[75] = {0,1,2,3,4,5,6,7,8, 
0,1,2,3,4,5,6,7,85 0,1,2,39,4,5,6,7,8, 
ο 12537455767 1, 8% 0,171,2,8,4,5,6,7,8, 
0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8, 
0,1,2,3,4,5,6,7,8, 0,1,2}; 

т = (п & Ox7FFF) - (п >> 15); // ЕЕЕЕ0001 to 7FFF. 

г = (гб 0х01ЕЕ) - (r >> 9); // FFFFFFC1 to 2FF. 


к = (r & Ox003F) + (r >> 6); // 0 to 4A. 


return table[r]; 


FIGURE 10-29. Unsigned remainder modulo 9, digit summing 
method. 


Signed Remainder 


The digit summing method can be adapted to compute the 
remainder resulting from signed division. There seems to be no 
better way than to add a few steps to correct the result of the 
method as applied to unsigned division. Two corrections are 
necessary: (1) correct for a different interpretation of the sign 
bit, and (2) add or subtract a multiple of the divisor d to get the 
result in the range O to - (d - 1). 


For division by 3, the unsigned remainder code interprets the 
sign bit of the dividend n as contributing 2 to the remainder 
(because 231 mod 3 = 2). For the remainder of signed division, 
the sign bit contributes only 1 (because (-231) mod 3 - 1). 
Therefore, we can use the code for an unsigned remainder and 
correct its result by subtracting 1. Then, the result must be put 
in the range О to -2. That is, the result of the unsigned 
remainder code must be mapped as follows: 


(0, 1, 2) > (-1, 0, 1) = (-1, 0, -2). 


This adjustment can be done fairly efficiently by subtracting 1 
from the unsigned remainder if it is O or 1, and subtracting 4 if it 
is 2 (when the dividend is negative). The code must not alter the 
dividend n, because it is needed in this last step. 


This procedure can easily be applied to any of the functions 
given for the unsigned remainder modulo 3. For example, 
applying it to Figure 10-26 on page 265 gives the function 
shown in Figure 10-30. It is 13 elementary instructions, plus an 
indexed load. The instruction count can be reduced by using a 
larger table. 


Click here to view code image 


int rems3(int n) { 
unsigned r; 
static char table[ 
0,127. θα, 


2 


62] - 
01,2, 


ds n; 

г = (г >> 16) + (r & OxFFFF); // Max OxlFFFE. 
r= (r > 8) + (r & 0х00ЕЕ), // Max Ox2FD. 
r= (r»» 4) + (r & 0x000F); // Max Ox3D. 

r — table[r]; 

return r - (((unsigned)n >> 31) << (r & 2)); 


FIGURE 10-30. Signed remainder modulo 3, digit summing 
method. 


Figures 10-31 to 10-33 show similar code for computing the 
signed remainder of division by 5, 7, and 9. All the functions 
consist of 15 elementary operations, plus an indexed load. They 
use signed right shifts, and the final adjustment consists of 
subtracting the modulus if the dividend is negative and the 
remainder is nonzero. The number of instructions can be 
reduced by using larger tables. 


Click here to view code image 


int rems5 (int n) { 


int ας 

static char table[62] = (2,3,4, 0,1,2,3,4, 0,1,2,3,4, 
0, πο ου (07,1727 374% 05,1L,2,9,5,4;, бт, За, 
0,122534, 0,122, 3,4, θα μοι 0517273; 4; 
0; 1,2,98,4, Ὃν ο 

r= (n >> 16) + (n & OxFFFF); // FFFF8000 to 17FFE. 

г = (r > 8) + (r & 0х00ЕЕ), // FFFFFF80 to 27D. 

г = (r > 4) + (r & 0x000F); // -8 to 53 (decimal). 

r = table[r + 8]; 

return r - (((int)(n & -r) >> 31) & 5); 


FIGURE 10-31. Signed remainder modulo 5, digit summing 
method. 


Click here to view code image 


int rems7(int n) { 
int ЖУ 


static char table[75] = {5,6, 0,1,2,3,4,5,0, 
0,1,2,3,4,55,6; 071927374557 Oy 05;1,2,;3,4,5,60, 
О ο tror Op Orly 273747076; 0,1,2,3,4,5,9, 
0,14,25,354,55,6, 0,1,2,9,4,5,6, 01,2) 37-47 57:67 0,152]; 


г = (п >> 15) + (n & 0х7ЕЕЕ), // FFFF0000 to 17FFE. 
r= (r >> 9) + & 0х001ЕЕ); // FFFFFF80 to 2BD. 
r= (r >> 6) + (r & 0x0003F); // -2 to 72 (decimal). 
к = table[r + 2]; 

return r - (((int) (n & -r) >> 31) & 7); 


FIGURE 10-32. Signed remainder modulo 7, digit summing 
method. 


Click here to view code image 


int rems9(int n) { 
ЯЛЛ es 


static char table[75] = {7,8, 0,1,2,3,4,5,6,7,8, 
0,;1,2,3,4,5,6,7,80, 0,15,2,3,4,5,0,7;8; 
05;1,2,3,54,5,6;7,0, 0,51,2,3,4,5,0;17,8; 
0,1,2,3,4,5,6,75,9, 0,1;2,3,;41,;5,6,7,98; 
Q,1,2,3,4,5,6,;75,98, 0}; 


= (n & Ox7FFF) (n >> 15); // FFFF7001 to 17FFF. 
= (r & OxO1FF) - (r >> 9); // ЕЕЕЕЕЕ41 to 0х27Е. 
) 


BRK H 


= (га 0х003Е) + (r >> 6); // -2 to 72 (decimal). 
= table[r + 2]; 
return r - (( (106) (п & -r) >> 31) & 9); 


FIGURE 10-33. Signed remainder modulo 9, digit summing 
method. 


10-20 Remainder by Multiplication and Shifting 
Right 


The method described in this section applies, in principle, to all 
integer divisors greater than 2, but as a practical matter only to 
fairly small divisors and to divisors of the form 2k - 1. As in the 
preceding section, in most cases the code resorts to a table 
lookup after a fairly short calculation. 


Unsigned Remainder 


This section uses the mathematical (not computer algebra) 
notation a mod b, where a and b are integers and b > 0, to 
denote the integer x, 0 < x < b, that satisfies x = a (mod b). 


To compute n mod 3, observe that 


n mod 3 = EJ mod 4. (3) 
Proof: Let n = 3k + 6, where 6 and k are integers and 0 < 6 < 
2. Then 


ESI mod4 - ΚΠ mod4 - ka mod 4. 


Clearly, the value of the last expression is 0, 1, or 2 for ё = 0, 1, 
or 2 respectively. This allows changing the problem of 
computing the remainder modulo 3 to one of computing the 
remainder modulo 4, which is of course much easier on a binary 
computer. 


Relations like (3) do not hold for all moduli, but similar 
relations do hold if the modulus is of the form 2k - 1, for k an 
integer greater than 1. For example, it is easy to show that 


n mod 7 = E mod 8. 


For numbers not of the form 2k - 1, there is no such simple 
relation, but there is a certain uniqueness property that can be 
used to compute the remainder for other divisors. For example, 
if the divisor is 10 (decimal), consider the expression 


ГА mod 16. (4) 
10 


Letn = 10k + 8 where O x 8 x 9. Then 


16 16 m 
d16 = | —(10k +ô) | mod 16 = | mod 16. 
| ion | mo K )| 10 
For 6 = 0,1, 2, 3, 4, 5, 6, 7, 8, and 9, the last expression takes 
on the values 0, 1, 3, 4, 6, 8, 9, 11, 12, and 14 respectively. The 
latter numbers are all distinct. Therefore, if we can find a 


reasonably easy way to compute (4), we can translate 0 to 0, 1 
to 1, 3 to 2, 4 to 3, and so on, to obtain the remainder of 
division by 10. This will generally require a translation table of 
size equal to the next power of 2 greater than the divisor, so the 
method is practical only for fairly small divisors (and for divisors 
of the form 2k - 1, for which table lookup is not required). 


The code to be shown was derived by using a little of the 
above theory and a lot of trial and error. 


Consider the remainder of unsigned division by 3. Following 
(3), we wish to compute the rightmost two bits of the integer 
part of 4n/ 3. This can be done approximately by multiplying by 
1232 / 3 апа then dividing by 230 using a shift right instruction. 
When the multiplication by | 222 / 3 | is done (using the multiply 
instruction that gives the low-order 32 bits of the product), high- 
order bits will be lost. But that doesn't matter, and, in fact, it's 
helpful, because we want the result modulo 4. Therefore, 
because | 232 / 3 | = 0x55555555, a possible plan is to compute 


г < (0х5555 5555 * n) >> 30. 


Experiment indicates that this works Юг n in the range 0 to 
230 + 2. It almost works, I should say; if n is nonzero and a 
multiple of 3, it gives the result 3. Therefore, it must be followed 
by a translation step that translates (0, 1, 2, 3) to (0, 1, 2, 0) 
respectively. 


To extend the range of applicability, the multiplication must 
be done more accurately. Two more bits of accuracy suffice (that 
is, multiplying by 0x55555555.4). The following calculation, 
followed by the translation step, works for all n representable as 
an unsigned 32-bit integer: 


г «— (0x55555555 + n + (n 5 2)) Z 30. 


It is, of course, possible to give a formal proof of this, but the 
algebra is quite lengthy and error prone. 


The translation step can be done in three or four instructions 
on most machines, but there is a way to avoid it at a cost of two 
instructions. The above expression for computing r estimates 
low. If you estimate slightly high, the result is always 0, 1, or 2. 
This gives the C function shown in Figure 10-34 (eight 


instructions, including a multiply). 
Click here to view code image 
int remu3 (unsigned n) { 


return (0x55555555*n + (п >> 1) - (п >> 3)) >> 30; 
} 


FIGURE 10-34. Unsigned remainder modulo 3, multiplication 
method. 


The multiplication can be expanded, giving the 13-instruction 
function shown in Figure 10-35 that uses only shift’s and add's. 


Click here to view code image 


int remu3(unsigned n) { 
unsigned r; 


HHHHHH 
I l HM H H 1 

нв 
I + + + + + 


Ë 
return r >> 30; 


FIGURE 10-35. Unsigned remainder modulo 3, multiplication 
(expanded) method. 


The remainder of unsigned division by 5 can be computed 
very similarly to the remainder of division by 3. Letn = 5k + r 
with 0 < г < 4. Then (8 / 5)n mod 8 = (8 / 5)(5 k+ r) mod 8 
= (8 / 5)r mod 8. For r = 0, 1, 2, 3, and 4, this takes on the 
values 0, 1, 3, 4, and 6 respectively. Since |232 / 5, = 
0x33333333, this leads to the function shown in Figure 10-36 
(11 instructions, including a multiply). The last step (code on the 
return Statement) is mapping (0, 1, 3, 4, 6, 7) to (0, 1, 2, 3, 4, O) 
respectively, using an in-register method rather than an indexed 
load from memory. By also mapping 2 to 2 and 5 to 4, the 
precision required in the multiplication by 232 / 5 is reduced to 
using just the term n »» 3 to approximate the missing part of 
the multiplier (hexadecimal 0.333...). If the “accuracy” term n 


»» 3 is omitted, the code still works for n ranging from O to 
0x60000004. 


Click here to view code image 


int remu5(unsigned n) { 
п = (0x33333333*n + (п >> 3)) >> 29; 
return (0x04432210 >> (n<< 2)) & 7; 


FIGURE 10-36. Unsigned remainder modulo 5, multiplication 
method. 


The code for computing the unsigned remainder modulo 7 is 
similar, but the mapping step is simpler; it is necessary only to 
convert 7 to 0. One way to code it is shown in Figure 10-37 (11 
instructions, including a multiply). If the accuracy term n »» 4 is 
omitted, the code still works for n up to 0x40000006. With both 
accuracy terms omitted, it works for n up to 0x08000006. 


Click here to view code image 


int remu7 (unsigned n) { 
п = (0x24924924*n + (п >> 1) + (п >> 4)) >> 29; 
return n & ((int)(n - 7) >> 31); 


FIGURE 10-37. Unsigned remainder modulo 7, multiplication 
method. 


Code for computing the unsigned remainder modulo 9 is 
shown in Figure 10—38. It is six instructions, including a multiply, 
plus an indexed load. If the accuracy term n »» 1 is omitted and 
the multiplier is changed to 0x1C71C71D, the function works for 
n up to 0x1999999E. 


Click here to view code image 
int remu9 (unsigned n) { 
static char table[16] = (0, 1, 1, 2, 2, 3, 3, 4, 


5, 5, 6, 6, 7, 7, 8, 8); 


п = (0x1C71C71C*n + (п >> 1)) >> 28; 
return table[n]; 


FIGURE 10-38. Unsigned remainder modulo 9, multiplication 
method. 


Figure 10-39 shows a way to compute the unsigned 
remainder modulo 10. It is eight instructions, including a 
multiply, plus an indexed load instruction. If the accuracy term n 
>> з is omitted, the code works for n up to 0x40000004. If both 
accuracy terms are omitted, it works for n up to OXOAAAAAAD. 


Click here to view code image 


int remul0 (unsigned n) { 


static char table[16] = (0, 1, 2, 2, 3, 3, 4, 5, 
91у 6, Ty d 8, 8, 9; 0); 
п = (0x19999999*n + (п >> 1) + (п >> 3)) >> 28; 


return table[n]; 


FIGURE 10-39. Unsigned remainder modulo 10, 
multiplication method. 


As a final example, consider the computation of the 
remainder modulo 63. This function is used by the population 
count program at the top of page 84. Joe Keane [Keane] has 
come up with the rather mysterious code shown in Figure 10- 
40. It is 12 elementary instructions on the basic RISC. 


Click here to view code image 


int remu63 (unsigned n) { 
unsigned t; 


Е = (((n»» 12) + n) >> 10) + (n << 2); 
Е = ((t >> 6) + t + 3) & OxFF; 
return (t = (t >> 6)) >> 2; 


FIGURE 10-40. Unsigned remainder modulo 63, Keane’s 
method. 


The “multiply and shift right” method leads to the code 


shown in Figure 10—41. This is 11 instructions on the basic RISC, 
one being a multiply. This would not be as fast as Keane's 
method, unless the machine has a very fast multiply and the 
load of the constant 0x04104104 can move out of a loop. 


Click here to view code image 


int remu63 (unsigned n) { 
п = (0x04104104*n + (п >> 4) + (п >> 10)) >> 26; 
return n & ((n - 63) >> 6); // Change 63 to 0. 


FIGURE 10-41. Unsigned remainder modulo 63, 
multiplication method. 


On some machines, an improvement can result from 
expanding the multiplication into shifts and adds as follows (15 
elementary instructions for the whole function): 


Click here to view code image 


r = (n<< 2) + (n<< 8); // г = 0x104¥*n. 
rr + (r<< 12); // = = 0x104104*n. 
r^r + (n<< 26); // r = 0x04104104*n. 


Signed Remainder 


As in the case of the digit summing method, the “multiply and 
shift right” method can be adapted to compute the remainder 
resulting from signed division. Again, there seems to be no 
better way than to add a few steps to correct the result of the 
method as applied to unsigned division. For example, the code 
shown in Figure 10-42 is derived from Figure 10-34 on page 
270 (12 instructions, including a multiply). 


Click here to view code image 
int rems3(int n) { 

unsigned r; 

к = n; 


г = (0x55555555*r + (р >> 1) = (т >> 3)) >> 30; 
return r - (((unsigned)n >> 31) << (r & 2)); 


FIGURE 10-42. Signed remainder modulo 3, multiplication 
method. 


Some plausible ways to compute the remainder of signed 
division by 5, 7, 9, and 10 are shown in Figures 10-43 to 10-46. 
The code for a divisor of 7 uses quite a few extra instructions 
(19 in all, including a multiply); it might be preferable to use a 
table similar to that shown for the cases in which the divisor is 
5, 9, or 10. In the latter cases, the table used for unsigned 
division is doubled in size, with the sign bit of the divisor 
factored in to index the table. Entries shown as u are unused. 


Click here to view code image 


int rems5(int n) { 
unsigned r; 


static signed char table[16] = (0, 1, 2, 2, 3, u, 4, O, 
ц, 0,-4, 
üu-3)-2,92,-101]; 
r = n; 
r = ((0x33333333*r) + (r >> 3)) >> 29; 
return table[r + (((unsigned)n >> 31) << 3)]; 


FIGURE 10-43. Signed remainder modulo 5, multiplication 
method. 


Click here to view code image 


int rems7 (int n) { 
unsigned r; 


r =n - (((unsigned)n >> 31) << 2); // Fix for sign. 
г = ((0x24924924*r) + (r >> 1) + (г >> 4)) >> 29; 

E cur (ant) (= T) 2231); // Change 7 to 0. 
return r - (((int) (n&-r) >> 31) & 7);// Fix п<0 case. 


FIGURE 10-44. Signed remainder modulo 7, multiplication 
method. 


Click here to view code image 


int rems9(int n) { 


unsigned r; 
static signed char table[32] = (0,1, 1, 2, u, 3, u, 4, 
Эр Dy бу бу T, Uy бу uU; 
-4, u,-3, u,-2,-1,-1, 0, 


u,-8, 
u,77,-6,-6,-5,-5]); 
r= n; 
г = (Ox1C71C71C*r + (r >> 1)) >> 28; 
return table[r + (((unsigned)n >> 31) << 4)]; 


FIGURE 10-45. Signed remainder modulo 9, multiplication 
method. 


Click here to view code image 


int remsl0(int n) í 
unsigned r; 


static signed char table[32] = {0, 1, u, 2, 3, u, 4, 5, 
5, 6, u, di 8, u, 9, и, 
=бу=5, πας ισπ Uy 


=; 0, u,-9, uz = 8 =E 
u}; 


r = n; 
r = (0x19999999*r + (r >> 1) + (r >> 3)) >> 28; 
return table[r + (((unsigned)n >> 31) << 4)]; 


FIGURE 10-46. Signed remainder modulo 10, multiplication 
method. 


10-21 Converting to Exact Division 


Since the remainder can be computed without computing the 
quotient, the possibility arises of computing the quotient q = 
(1/4| by first computing the remainder, subtracting this from 
the dividend n, and then dividing the difference by the divisor d. 
This last division is an exact division, and it can be done by 
multiplying by the multiplicative inverse of d (see Section 10- 
16, “Exact Division by Constants,” on page 240). This method 
would be particularly attractive if both the quotient and 
remainder are wanted. 


Let us try this for the case of unsigned division by 3. 
Computing the remainder by the multiplication method (Figure 


10-34 on page 270) leads to the function shown in Figure 10- 
47. 


Click here to view code image 


unsigned 41943 (unsigned n) { 
unsigned r; 


г = (0x55555555*n + (п >> 1) - (п >> 3)) >> 30; 
return (n - r)*OxAAAAAAAB; 


FIGURE 10-47. Unsigned remainder and quotient with 
divisor — 3, using exact division. 


This is 11 instructions, including two multiplications by large 
numbers. (The constant 0x55555555 can be generated by shifting 
the constant oxAAAAAAAB right one position.) In contrast, the 
more straightforward method of computing the quotient q using 
(for example) the code of Figure 10-8 on page 254, requires 14 
instructions, including two multiplications by small numbers, or 
17 elementary operations if the multiplications are expanded 
into shift's and add's. If the remainder is also wanted, and it is 
computed from r = n - q*3, the more straightforward method 
requires 16 instructions, including three multiplications by small 
numbers, or 20 elementary instructions if the multiplications are 
expanded into shift's and add's. 


The code of Figure 10-47 is not attractive if the 
multiplications are expanded into shift’s and add’s; the result is 
24 elementary instructions. Thus, the exact division method 
might be a good one on a machine that does not have multiply 
high but does have a fast modulo 232 multiply and slow divide, 
particularly if it can easily deal with the large constants. 


For signed division by 3, the exact division method might be 
coded as shown in Figure 10—48. It is 15 instructions, including 
two multiplications by large constants. 


Click here to view code image 


int divs3(int n) { 
unsigned r; 


r = n; 
r = (0x55555555*r + (r >> 1) = (r >> 3)) >> 30; 
r= г - (((unsigned)n >> 31) << (r & 2)); 
return (n - r)*OxAAAAAAAB; 


FIGURE 10-48. Signed remainder and quotient with divisor 
— 8, using exact division. 


As a final example, Figure 10-49 shows code for computing 
the quotient and remainder for unsigned division by 10. It is 12 
instructions, including two multiplications by large constants, 
plus an indexed load instruction. 


Click here to view code image 


unsigned divul0 (unsigned n) { 
unsigned r; 
static char table[16] = (0, 1, 2, 2, 3, 3, 4, 5, 
Bi Ὁ dy Tp By By 9, 0 


г = (0x19999999*n + (п >> 1) + (п >> 3)) >> 28; 
r = table[r]; 
return ((n - г) >> 1) *OxCCCCCCCD; 


FIGURE 10-49. Signed remainder and quotient with divisor 
= 10, using exact division. 


10-22 A Timing Test 


Many machines have a 32 x 32 = 64 multiply instruction, so one 
would expect that to divide by a constant such as 3, the code 
shown on page 228 would be fastest. If that multiply instruction 
is not present, but the machine has a fast 32x 32 ^ 32 multiply 
instruction, then the exact division method might be a good one 
if the machine has a slow divide and a fast multiply. To test this 
conjecture, an assembly language program was constructed to 
compare four methods of dividing by 3. The results are shown in 
Table 10-4. The machine used was a 667 MHz Pentium III (ca. 
2000), and one would expect similar results on many other 
machines. 


TABLE 10-4. UNSIGNED DIVIDE BY 3 ON A PENTIUM III 


Division Method 
Using machine's divide instruction (div1) 
Using 32x32 — 64 multiply (code on page 228) 


All elementary instructions (Figure 10—8 on page 254) 


Convert to exact division (Figure 10—47 on page 274) 


The first row gives the time in cycles for just two instructions: 
an хог1 to clear the left half of the 64-bit source register, and 
the divi instruction, which evidently takes 40 cycles. The 
second row also gives the time for just two instructions: multiply 
and shift right 1 (mull and shri). The third row gives the time 
for a sequence of 21 elementary instructions. It is the code of 
Figure 10-8 on page 254 using alternative 2, and with the 
multiplication by 3 done with a single instruction (1ea1). Several 
move instructions are necessary because the machine is 
(basically) two-address. The last row gives the time for a 
sequence of 10 instructions: two multiplications (imu11) and the 
rest elementary. The two imull instructions use 4-byte 
immediate fields for the large constants. (The signed multiply 
instruction imuii is used rather than its unsigned counterpart 
mull, because they give the same result in the low-order 32 bits, 
and imuii has more addressing modes available.) 


The exact division method would be even more favorable 
compared to the second and third methods if both the quotient 
and remainder were wanted, because they would require 
additional code for the computation г < n - q*3. (The aivi 
instruction produces the remainder as well as the quotient.) 


10-23 A Circuit for Dividing by 3 


There is a simple circuit for dividing by 3 that is about as 
complex as an adder. It can be constructed very similarly to the 
elementary way one constructs an n-bit adder from n 1-bit “full 
adder" circuits. However, in the divider signals flow from most 
significant to least significant bit. 


Consider dividing by 3 the way it is taught in grade school, 
but in binary. To produce each bit of the quotient, you divide 3 
into the next bit, but the bit is preceded by a remainder of 0, 1, 
or 2 from the previous stage. The logic is shown in Table 10-5. 
Here the remainder is represented by two bits r; and sj, with г; 


being the most significant bit. The remainder is never 3, so the 
last two rows of the table represent “don’t care" cases. 


A circuit for 32-bit division by 3 is shown in Figure 10-50. 
The quotient is the word consisting of bits γι through yo, and 
the remainder is 270 + so. 

Another way to implement the divide-by-3 operation in 
hardware is to use the multiplier to multiply the dividend by the 
reciprocal of 3 (binary 0.010101...), with appropriate rounding 
and scaling. This is the technique shown on pages 207 and 228. 


TABLE 10-5. LOGIC FOR DIVIDING BY 3 


Уулс Fisi tS 


"= Fi Sie Xo t 7113) 
Sj 7 Ti+ e К+ 181+ 12; 


FIGURE 10-50. Logic circuit for dividing by 3. 


Exercises 


1. Show that for unsigned division by an even number, the 
shrxi instruction (or equivalent code) can be avoided by 
first (a) turning off the low-order bit of the dividend (and 
operation) [CavWer] or (b) dividing the dividend by 2 
(shift right 1 instruction) and then dividing by half the 
divisor. 


2. Code a function in Python similar to that of Figure 10-4 
on page 240, but for computing the magic number for 
signed division. Consider only positive divisors. 


3. Show how you would use Newton's method to calculate 
the multiplicative inverse of an integer d modulo 81. 
Show the calculations for d — 146. 


I think that I shall never envision 
An op unlovely as division. 


An op whose answer must be guessed 
And then, through multiply, assessed; 


An op for which we dearly pay, 
In cycles wasted every day. 


Division code is often hairy; 
Long division's downright scary. 
The proofs can overtax your brain, 


The ceiling and floor may drive you insane. 


Good code to divide takes a Knuthian hero, 
But even God can't divide by zero! 


Chapter 11. Some Elementary Functions 


11-1 Integer Square Root 


By the "integer square root" function, we mean the function νὰ 
|. To extend its range of application and to avoid deciding what 


to do with a negative argument, we assume x is unsigned. Thus, 
0 < x < 232 - 1. 


Newton's Method 


For floating-point numbers, the square root is almost universally 
computed by Newton's method. This method begins by somehow 


obtaining a starting estimate Ρο of Va. Then, a series of more 
accurate estimates is obtained from 


Sn+1 ES (2,+ 472. 
Si 
The iteration converges quadratically—that is, if at some point 
gn is accurate to n bits, then gn + 1 is accurate to 2n bits. The 
program must have some means of knowing when it has iterated 
enough so it can terminate. 


It is a pleasant surprise that Newton's method works fine in 
the domain of integers. To see this, we need the following 
theorem: 


THEOREM. Let gn + 1 = |(@п + 1а / 2n))/2,, with gy, a integers 
greater than 0. Then 


(a) if g, > | Ja | then | Ма] <g,, i <g,. and 
(Б) if g, = | Ja] then | Ja] xg, SL Ja] * 1. 


That is, if we have an integral guess ёл to γα | that is too 
high, then the next guess gn + 1 will be strictly less than the 


preceding one, but not less than Wa |. Therefore, if we start with 
a guess that’s too high, the sequence converges monotonically. If 


the guess ση = Wa |, then the next guess is either equal to gn or 
is 1 larger. This provides an easy way to determine when the 


sequence has converged: If we start with go = Wa | 
convergence has occurred when gn + 1 > gn, and then the result 


is precisely gn. 


The case a = O must be treated specially, because this 
procedure would lead to dividing 0 by 0. 


Proof. (a) Because gn is an integer, 


“Ч Р ЭЭ [zs] 


Because gn > (Ма | and gn is an integer, g&n > Ala, Define £ by 
£ = 0 + ελα. Then € > 0 and 


gi *ta|-, 28:54 
28, dia 28, | 


1+&)?а+а|— ατα 
Biri — — 
2(1-- €) a 


22, 


2+ 25 +g? 
8 + < g . 
| 2(1 | £) 2 nl п? 


| Ja] S gn 1 <Er 


, (b) Because gn = (Аа, „а -1 < gn < Ja, so that 
gr $a«(g,* 1). Hence, we have 


24 g2 24 +1 2 
Е ë| РР Е (g. + 1) | 
28, 28, 


n 


1 
EA Sg, 1 < EN 1 +2) 
| Jalsg „1<| „+1 | (because g, is an integer and Le 1). 
ntl n 2g 


LJalsg,..s|g, ] *1 = [Ма] * 1. 


The difficult part of using Newton's method to calculate (АХ 1 
is getting the first guess. The procedure of Figure 11-1 sets the 


first guess go equal to the least power of 2 that is greater than or 


гс to „х For example, for x = 4, go = 2, and for = 5, 80 = 


Click here to view code image 


int isqrt (unsigned x) { 
unsigned x1; 
int s, 90, gl; 


if (x <= 1) return x; 


s = 1; 

xl = x - 1; 

if (xl > 65535) (s = s + 8; xl = xl >> 16;) 

if (x1 > 255) {s = s + 4; xl = x1 >> 8;} 

if (xl > 15) (8 = S + 2; xl = xl >> 4;} 

rf (xL > 3) (8 = s + 1;} 

g0 = 1 << s; // g0 = 2%**s. 

gl = (gO + (x >> s)) >> 1; // gl = (gO + x/g0)/2. 

while (gl < g0) { // Do while approximations 
40 = gl; // strictly decrease. 
σι = (90 + (x/g0)) >> 1; 


return g0; 


FIGURE 11-1. Integer square root, Newton’s method. 


Because the first guess go is a power of 2, it is not necessary 
to do a real division to get g1; instead, a shift right suffices. 


Because the first guess is accurate to about one bit, and 
Newton’s method converges quadratically (the number of bits of 
accuracy doubles with each iteration), one would expect the 
procedure to converge within about five iterations (on a 32-bit 
machine), which requires four divisions (because the first 
iteration substitutes a shift right). An exhaustive experiment 
reveals that the maximum number of divisions is five, or four for 
arguments up to 16,785,407. 


If number of leading zeros is available, then getting the first 
guess is very simple: Replace the first seven executable lines in 
the procedure above with 


Click here to view code image 


if (x <= 1) return x; 
s = 16 - nlz(x - 1)/2; 


Another alternative, if number of leading zeros is not available, 
is to compute s by means of a binary search tree. This method 
permits getting a slightly better value of go: the least power of 2 


that is greater than or equal to T | For some values of x, this 
gives a smaller value of go, but a value large enough so that the 


convergence criterion of the theorem still holds. The difference 
in these schemes is illustrated in the following table. 


Range of x Range of x First Guess 
for Figure 11-1 for Figure 11-2 σι 


0 0 
] 1103 
2104 4108 
510 16 9 to 24 
17 to 64 25 to 80 
65 to 256 81 to 288 


228+ 1 (0 230 | (214+ 1)? to (25 * 1)?-1 


230 + | to 23-1 (2P + 1)? to 232—1 


This procedure is shown in Figure 11-2. It is convenient there 
to treat small values of x(0 < x < 24) specially, so that no 
divisions are done for them. 


Click here to view code image 


int isqrt(unsigned x) { 
int s, g0, gl; 


if (x <= 4224) 
if (x <= 24) 
if (x <= 3) return (x + 3) >> 2; 
else if (x <= 8) return 2; 
else return (x >> 4) + 3; 
else if (x <= 288) 
if (x <= 80) s = 3; else s = 4; 
else if (x <= 1088) s = 5; else s = 6; 


else if (x «- 1025*1025 - 1) 
if (x <= 257*257 - 1) 
if (x <= 129*129 - 1) s 
else if (x <= 513*513 - 1) 
else if (x <= 4097*4097 - 1) 
if (x <= 2049*2049 - 1) s = 11; lse s = 12; 
else if (x <= 16385*16385 - 1) 
if (x <= 8193*8193 - 1) s = 13; else s = 14; 
else if (х <= 32769*32769 1) s = 15; else s = 16; 
g0 = 1<< s; // g0 = 24318. 


// Continue as in Figure 11-1. 


FIGURE 11-2. Integer square root, binary search for first 
guess. 


The worst-case execution time of the algorithm of Figure 11- 
1, on the basic RISC, is about 26 + (D + 6) n cycles, where D is 
the divide time in cycles and n is the number of times the while- 
loop is executed. The worst-case execution time of Figure 11-2 
is about 27 + (D + 6) n cycles, assuming (in both cases) that 
the branch instructions take one cycle. The table that follows 
gives the average number of times the loop is executed by the 
two algorithms, for x uniformly distributed in the indicated 
range. 


Figure 11-1 Figure 11—2 


0 to 9 
0 to 99 


0 to 999 
0 to 9999 


If we assume a divide time of 20 cycles and x ranging 
uniformly from 0 to 9999, then both algorithms execute in about 
81 cycles. 


Binary Search 


Because the algorithms based on Newton's method start out with 


a sort of binary search to obtain the first guess, why not do the 
whole computation with a binary search? This method would 
start out with two bounds, perhaps initialized to 0 and 216. It 
would make a guess at the midpoint of the bounds. If the square 
of the midpoint is greater than the argument x, then the upper 
bound is changed to be equal to the midpoint. If the square of 
the midpoint is less than the argument x, then the lower bound 
is changed to be equal to the midpoint. The process ends when 
the upper and lower bounds differ by 1, and the result is the 
lower bound. 


This avoids division, but requires quite a few multiplications 
—16 НО and 216 are used as the initial bounds. (The method 
gets one more bit of precision with each iteration.) Figure 11-3 
illustrates a variation of this procedure, which uses initial values 
for the bounds that are slight improvements over 0 and 216. The 
procedure shown in Figure 11-3 also saves a cycle in the loop, 
for most RISC machines, by altering a and b in such a way that 
the comparison is b = a rather than b - a = 1. 


The predicates that Wal be maintained at the beginning of 


each iteration are a < | „х 13, 1 and b > (АХ |. The initial value 
of b should be something that's easy to compute and close to | 


„х |. Reasonable initial values аге х, х + 4 + 1, х + 8 + 2, х 
+ 16 + 4, х + 32 + 8, х + 64 + 16, апа ѕо on. Expressions 
near the beginning of this list are better initial bounds for small 
x, and those near the end are better for larger x. (The value x + 
2 + 1 is acceptable, but probably not useful, because x + 4 + 1 
is everywhere a better or equal bound.) 


Click here to view code image 


int isqrt(unsigned x) { 


unsigned a, b, m; // Limits and midpoint. 
а = 1; 
b = (x >> 5) +8; // See text. 
if (b > 65535) b = 65535; 
do { 
m= (a + р) >> 1; 
if (m*m > x) b = m 1; 
else а = м 1: 


) while (b >= a); 
return a - 1; 


FIGURE 11-3. Integer square root, simple binary search. 


Seven variations on the procedure shown in Figure 11-3 can 
be more or less mechanically generated by substituting a + 1 
for a, or b — 1 for b, or by changing m = (a + b) + 2 to m = (a 
+ b + 1) + 2, or some combination of these substitutions. 


The execution time of the procedure shown in Figure 11-3 is 
about 6 + (M + 7.5)n, where M is the multiplication time in 
cycles and n is the number of times the loop is executed. The 
following table gives the average number of times the loop is 
executed, for x uniformly distributed in the indicated range. 


Average Number 
x of Loop Iterations 
0 to 9 
0 to 99 


0 to 999 
0 to 9999 
0 to 232 _ 1 


If we assume a multiplication time of 5 cycles and x ranging 
uniformly from 0 to 9999, the algorithm runs in about 94 cycles. 
The maximum execution time (n — 16) is about 206 cycles. 


If number of leading zeros is available, the initial bounds can 
be set from 


Click here to view code image 


b = (1<< (33 - nlz(x))/2) - 1; 
а = (b + 3)/2; 
That is, b = 233-10) "2 |, These are very good bounds for 


small values of x (one loop iteration for 0 < x < 15), but only a 
moderate improvement, for large x, over the bounds calculated 
in Figure 11-3. For x in the range 0 to 9999, the average 
number of iterations is about 5.45, which gives an execution 
time of about 74 cycles, using the same assumptions as above. 


A Hardware Algorithm 


There is a shift-and-subtract algorithm for computing the square 
root that is quite similar to the hardware division algorithm 
described in Figure 9-2 on page 193. Embodied in hardware on 
a 32-bit machine, this algorithm employs a 64-bit register that is 
initialized to 32 O-bits followed by the argument x. On each 
iteration, the 64-bit register is shifted left two positions, and the 
current result y (initially O) is shifted left one position. Then 2y 
+ 1 is subtracted from the left half of the 64-bit register. If the 
result of the subtraction is nonnegative, it replaces the left half 
of the 64-bit register, and 1 is added to y (this does not require 
an adder, because y ends in 0 at this point). If the result of the 
subtraction is negative, then the 64-bit register and y are left 
unaltered. The iteration is done 16 times. 


This algorithm was described in 1945 [JVN]. 


Perhaps surprisingly, this process runs in about half the time 
of that of the 64 + 32 = 32 hardware division algorithm cited, 
because it does half as many iterations and each iteration is 
about equally complex in the two algorithms. 


To code this algorithm in software, it is probably best to 
avoid the use of a doubleword shift register, which requires 
about four instructions to shift. The algorithm in Figure 11-4 
[GLS1] accomplishes this by shifting y and a mask bit m to the 
right. It executes in about 149 basic RISC instructions (average). 
The two expressions y | mcould also be y + m. 


The operation of this algorithm is similar to the grade-school 


method. It is illustrated here, for finding Lv 179 | on an 8-bit 
machine. 


Click here to view code image 


1011 0011 х0 Initially, x = 179 (0xB3). 
- 1 bl 


0111 0011 x1 0100 0000 y1 
= 111. b2 0010 0000 y2 


0010 0011 x2 0011 0000 y2 
- 11 01 b3 0001 1000 y3 


0010 0011 x3 0001 1000 y3 (Can’t subtract). 
- 1 1001 b4 0000 1100 y4 


0000 1010 x4 0000 1101 y4 


The result is 13 with a remainder of 10 left in register x. 


Click here to view code image 


int isqrt (unsigned x) { 
unsigned m, y, b; 


m = 0x40000000; 


y = 0; 
while(m != 0) { // Do 16 times. 
b=y | ш; 
у = у >> 1; 
if (x >= b) { 
x = x - b; 
y y | m; 


return y; 


FIGURE 11-4. Integer square root, hardware algorithm. 


It is possible to eliminate the if x >= b test by the usual 
trickery involving shift right signed 31. It can be proved that the 
high-order bit of b is always zero (in fact, b < 5 : 228), which 
simplifies the x >= b predicate (see page 23). The result is that 
the if statement group can be replaced with 


Click here to view code image 


= (int) (x | “(x - b)) >> 31; // -1 if x >= b, else 


K x O ct 


= x - (b & t) 


This replaces an average of three cycles with seven, assuming 
the machine has or not, but it might be worthwhile if a 
conditional branch in this context takes more than five cycles. 


Somehow it seems that it should be easier than some hundred 
cycles to compute an integer square root in software. Toward 
this end, we offer the expressions that follow to compute it for 


very small values of the argument. These can be useful to speed 
up some of the algorithms given above, if the argument is 
expected to be small. 


The expression iscorrect | and uses this 
in the many instruc- 
range tions (full RISC). 


x>0 


(x+3) #4 


x (x #2) 


x (x> 1) 
(x 12) 48 
(x 15) £8 


(x>0)+(x> 3) 


(x>0)+(x>3)+(x> 8) 


Ah, the elusive square root, 

It should be a cinch to compute. 
But the best we can do 
Is use powers of two 

And iterate the method of Newt! 


11-2 Integer Cube Root 


For cube roots, Newton’s method does not work out very well. 
The iterative formula is a bit complex: 


and there is of course the problem of getting a good starting 
value xo. 


However, there is a hardware algorithm, similar to the 


hardware algorithm for square root, that is not too bad for 
software. It is shown in Figure 11-5. 


The three add's of 1 can be replaced by or's of 1, because the 
value being incremented is even. Even with this change, the 
algorithm is of questionable value for implementation in 
hardware, mainly because of the multiplication y * (y + 1). 


This multiplication is easily avoided by applying the compiler 
optimization of strength reduction to the y-squared term. 
Introduce another unsigned variable y2 that will have the value 
of y-squared, by updating y2 appropriately wherever y receives 
a new value. Just before y = о insert y2 = 0. Just before y = 


2*y insert y2 = 4*y2. Change the assignment to p to b = (3*y2 
+ 3*y + 1) << s (and factor out the 3). Just before y = y + 1, 
insert y2 = y2 + 2*y + 1. The resulting program has no 


multiplications except by small constants, which can be changed 
to shift’s and add's. This program has three add's of 1, which can 
all be changed to or's of 1. It is faster unless your machine's 
multiply instruction takes only two or fewer cycles. 


Click here to view code image 
int icbrt(unsigned x) { 


int s; 
unsigned y, b; 


y = 0; 
for (s = 30; s >= O; s=s - 3) Í 
y = 2*y; 
b = (3*y*(y + 1) + 1) << s; 
if (x >= b) 4 
K = х = Ы; 
y = y +1; 
} 
} 
return y; 


FIGURE 11-5. Integer cube root, hardware algorithm. 


Caution: [GLS1] points out that the code of Figure 11-5, and 
its strength-reduced derivative, do not work if adapted in the 
obvious way to a 64-bit machine. The assignment to b can then 
overflow. This problem can be avoided by dropping the shift left 


of s from the assignment to », inserting after the assignment to 
b the assignment bs = b<< s, and changing the two lines if (x 
>= b) (x = x- b ... tO if (х >= bs && b == (bs >> s)) (x 
—-x-bs .... 


11-3 Integer Exponentiation 
Computing x? by Binary Decomposition of n 


A well-known technique for computing x", when п is a 
nonnegative integer, involves the binary representation of n. The 
technique applies to the evaluation of an expression of the form 
ххх... x where : is any associative operation, such as 
addition, multiplication including matrix multiplication, and 
string concatenation (as suggested by the notation (‘ab’)? = 
‘ababab’). As an example, suppose we wish to compute y = x13. 
Because 13 expressed in binary is 1101 (that is, 13 = 8 + 4 + 
1), 


xl3 = x8 + 4 + 1 = χ8.χ4. x1, 
Thus, x13 can be computed as follows: 
tex? 
t, — 1? 
t, e (3 
y ts É, ` X 


This requires five multiplications, considerably fewer than the 
12 that would be required by repeated multiplication by x. 


If the exponent is a variable, known to be a nonnegative 
integer, the technique can be employed in a subroutine, as 
shown in Figure 11-6. 

The number of multiplications done by this method is, for 
exponent n > 1, 


| log; | + nbits(n) – 1. 


This is not always the minimal number of multiplications. For 
example, for n = 27, the binary decomposition method 


computes 
x16- x8. χὰ. xl, 


which requires seven multiplications. However, the scheme 
illustrated by 


((х3)3)3 


requires only six. The smallest number for which the binary 
decomposition method is not optimal is n = 15 (Hint: x15 = 
(x3)5). 

Perhaps surprisingly, there is no known simple method that, 
for all n, finds an optimal sequence of multiplications to 
compute x?. The only known methods involve an extensive 
search. The problem is discussed at some length in [Knu2, 
4.6.3]. 


The binary decomposition method has a variant that scans 
the binary representation of the exponent in left-to-right order 
[Rib, 32], which is analogous to the left-to-right method of 
converting binary to decimal. Initialize the result y to 1, and 
scan the exponent from left to right. When a 0 is encountered, 
square y. When a 1 is encountered, square y and multiply it by x. 


This computes x!3 = x!!! 


(((12 : x)2 : x)2)2 : x. 


as 


Click here to view code image 


int iexp(int x, unsigned n) { 


int p, y; 

у = 1; // Initialize result 

p = x; // and p. 

while(1) { 
if (n & 1) y = p*y; // If n is odd, mult by p. 
ncquc // Position next bit of n. 
if (n == 0) return y; // If no more bits in n. 
р = p*p; // Power for next bit of n. 


FIGURE 11-6. Computing xn by binary decomposition of n. 


It always requires the same number of (nontrivial) 
multiplications as the right-to-left method of Figure 11-6. 


2n in Fortran 


The IBM XL Fortran compiler takes the definition of this function 
to be 


2", 0<п<30, 
pow2(n) = 4-23 n = 31, 
0, n<Oorn2 32. 


It is assumed that n and the result are interpreted as signed 
integers. The ANSI/ISO Fortran standard requires that the result 
be 0 if n < 0. The definition above for n 2 31 seems reasonable 
in that it is the correct result modulo 232, and it agrees with 
what repeated multiplication would give. 


The standard way to compute 2" is to put the integer 1 in a 
register and shift it left n places. This does not satisfy the Fortran 
definition, because shift amounts are usually treated modulo 64 
or modulo 32 (on a 32-bit machine), which gives incorrect 
results for large or negative shift amounts. 


If your machine has number of leading zeros, pow2(n) can be 
computed in four instructions as follows [Shep]: 


x < nlz(n >> 5); Их < 32if0 En € 31, x < 32 otherwise. 


х+ x5; // x <1 if 0<n<31, 0 otherwise. 
pow2 — x < n; 


The shift right operations are “logical” (not sign-propagating), 
even though n is a signed quantity. 


If the machine does not have the nlz instruction, its use above 
can be replaced with one of the x = 0 tests given in 
"Comparison Predicates" on page 23, changing the expression 
x > 5 00 X > 31. д possibly better method is to realize that the 


-5 39 
predicate 0 < x < 31 is eauivalent to х < 32, and then 


" ч + . . . . . 
simplify the expression for * =.) given in the cited section; it 
becomes ^x & (x - 32). This gives a solution in five instructions 


(four if the machine has and not): 


x «€ an & (n — 32); // x «0 iff nx3l. 
x< x> 31; // x= 1if 0<n<31, 0 otherwise. 
pow2 — x < n; 

11-4 Integer Logarithm 


By the “integer logarithm” function we mean the function |logb 
xj, where x is a positive integer and b is an integer greater than 
or equal to 2. Usually, b = 2 or 10, and we denote these 
functions by “ilog2” and “ilog10,” respectively. We use “ilog” 
when the base is unspecified. 


It is convenient to extend the definition to x = 0 by defining 
ilog(0) = -1 [CJS]. There are several reasons for this definition: 


• The function ilog2(x) is then related very simply to the 
number of leading zeros function, nlz(x), by the formula 
shown below, including the case x = 0. Thus, if one of 
these functions is implemented in hardware or software, 
the other is easily obtained. 


ilog2(x) = 31 - nlz(x) 


* It is easy to compute llog(x)! using the formula below. 
For x — 1, this formula implies that ilog(0) — -1. 


[log(x)! = ilog(x- 1) + 1 


• It makes the following identity hold for x = 1 (but it 
doesn't hold for x — 0). 


ilog2(x + 2) = ilog2(x)- 1 


* It makes the result of ilog(x) a small dense set of integers 
(-1 to 31 for ilog2(x) on a 32-bit machine, with x 
unsigned), making it directly useful for indexing a table. 


• It falls naturally out of several algorithms for computing 
ilog2(x) and ilog10(x). 


Unfortunately, it isn't the right definition for “number of 
digits of x," which is ilog(x) + 1 for all x except x = 0. It seems 
best to consider that anomalous. 


For x < 0, ilog(x) is left undefined. To extend its range of 
utility, we define the function as mapping unsigned numbers to 
signed numbers. Thus, a negative argument cannot occur. 


Integer Log Base 2 


Computing ilog2(x) is essentially the same as computing the 
number of leading zeros, which is discussed in “Counting 
Leading 0’s” on page 99. All the algorithms in that section can 
be easily modified to compute ilog2(x) directly, rather than by 
computing nlz(x) and subtracting the result from 31. (For the 
algorithm of Figure 5-16 on page 102, change the line return 
рор(-х) tO return рор(х) - 1.) 


Integer Log Base 10 


This function has application in converting a number to decimal 
for inclusion into a line with leading zeros suppressed. The 
conversion process successively divides by 10, producing the 
least significant digit first. It would be useful to know ahead of 
time where the least significant digit should be placed, to avoid 
putting the converted number in a temporary area and then 
moving it. 

To compute ilog10(x), a table search is quite reasonable. This 
could be a binary search, but because the table is small and in 
many applications x is usually small, a simple linear search is 
probably best. This rather straightforward program is shown in 
Figure 11-7. 


On the basic RISC, this program can be implemented to 
execute in about 9 + 4 |log1ox, instructions. Thus, it executes 
in five to 45 instructions, with perhaps 13 (for 10 < x < 99) 
being typical. 

The program in Figure 11-7 can easily be changed into an “in 
register" version (not using a table). The executable part of such 
a program is shown in Figure 11-8. This might be useful if the 
machine has a fast way to multiply by 10. 


Click here to view code image 


int 110910 (unsigned x) { 
int i; 
static unsigned table[11] = (0, 9, 99, 999, 9999, 
99999, 999999, 9999999, 99999999, 999999999, 


OxFFFFFFFF}; 


for (1 = -1; ; 1++) { 
if (x <= table[i-*1]) return i; 


} 


FIGURE 11-7. Integer log base 10, simple table search. 


Click here to view code image 


for (i = -1; і <= 8; 144) 4 
if (x < p) return i; 


return i; 


FIGURE 11-8. Integer log base 10, repeated multiplication by 
10. 


This program can be implemented to execute in about 10 + 
6 |1og10x instructions on the basic RISC (counting the multiply 
as one instruction). This amounts to 16 instructions for 10 < x 
< 99. 


A binary search can be used, giving an algorithm that is loop- 
free and does not use a table. Such an algorithm might compare 
x to 104, then to either 102 or to 106, and so on, until the 
exponent п is found such that 10" < x < 10n + 1. The paths 
execute in ten to 18 instructions, four or five of which are 
branches (counting the final unconditional branch). 


The program shown in Figure 11-9 is a modification of the 
binary search that has a maximum of four branches on any path 
and is written in a way that favors small x. It executes in six 
basic RISC instructions for 10 < x x 99, and in 11 to 16 
instructions for x = 100. 


The shift instructions in this program are signed shifts (which 
is the reason for the (int) casts). If your machine does not have 
this instruction, one of the alternatives below, which use 
unsigned shifts, may be preferable. These are illustrated for the 
case of the first return statement. Unfortunately, the first two 
require subtract from immediate for efficient implementation, 


which most machines don't have. The last involves adding a 
large constant (two instructions), but this does not matter for the 
second and third return statements, which require adding a 
large constant anyway. The large constant is 231 — 1000. 


Click here to view code image 
return 3 - ((x - 1000) >> 31); 


return 2 + ((999 - x) >> 31); 
return 2 + ((x + 2147482648) >> 31); 


An alternative for the fourth return statement is 
Click here to view code image 
return 8 + ((x + 1147483648) | x) >> 31; 


where the large constant is 23! — 109. This avoids both the and 
not and the signed shift. 


Alternatives for the last if-else construction are 


Click here to view code image 


return ((int)(x - 1) >> 31) | ((unsigned) (9 - x) >> 
31); 
return (x > 9) + (x > 0) - 1; 


either of which saves a branch. 


Click here to view code image 


int ilog10 (unsigned x) { 
if (x > 99) 
if (x < 1000000) 
if (x < 10000) 
return 3 + ((int) (x - 1000) >> 31); 
else 
return 5 + ((int) (x - 100000) >> 31); 
else 
if (x < 100000000) 
return 7 + ((int) (x - 10000000) >> 31); 
else 
return 9 + ((int) ((x-1000000000) &~x) >> 31); 
else 
if (x > 9) return 1; 
else return ((int) (x - 1) >> 31); 


FIGURE 11-9. Integer log base 10, modified binary search. 


If nlz(x) or ilog2(x) is available as an instruction, there are 
better and more interesting ways to compute iloglO(x). For 
example, the program in Figure 11-10 does it in two table 
lookups [CJS]. 


From tabiei an approximation to ilog10(x) is obtained. The 
approximation is usually the correct value, but it is too high by 1 
for x = 0 and for x in the range 8 to 9, 64 to 99, 512 to 999, 
8192 to 9999, and so on. The second table gives the value below 
which the estimate must be corrected by subtracting 1. 


This scheme uses a total of 73 bytes for tables and can be 
coded in only six instructions on the IBM System/370 [CJS] (to 
achieve this, the values іп table1 must be four times the values 
shown). It executes in about ten instructions on a RISC that has 
number of leading zeros, but no other uncommon instructions. 
The other methods to be discussed are variants of this. 


The first variation eliminates the conditional branch that 
results from the if statement. Actually, the program in Figure 
11-10 can be coded free of branches if the machine has the set 
less than unsigned instruction, but the method to be described can 
be used on machines that have no unusual instructions (other 
than number of leading zeros). 


The method is to replace the if statement with a subtraction 
followed by a shift right of 31, so that the sign bit can be 
subtracted from y. A difficulty occurs for large x(x 2 231 + 
109), which can be fixed by adding an entry to table2, as shown 
in Figure 11-11. 


This executes in about 11 instructions on a RISC that has 
number of leading zeros but is otherwise quite “basic.” It can be 
modified to return the value 0, rather than -1, for x = 0 (which 
is preferable for the decimal conversion problem) by changing 
the last entry in table1 to 1 (that is, by changing “0, 0, 0, 0” to 
“0, 0, 0, 1”). 


Click here to view code image 


int 110910 (unsigned x) { 
int y; 
static unsigned char tablel[33] = (9, 9, 9, 8, 8, 8, 
Ty Tp Te. Op бу Oy 6, Sy ὃν Sp 4, A, 4, ὃν 3, 3y 3, 


An ο hg Ay Жу dl Ὃν Uy 0410447 
static unsigned table2[10] = ( 1, 10, 100, 1000, 10000, 
100000, 1000000, 10000000, 100000000, 1000000000}; 


y = tablel[nlz(x)]; 
if (x < table2[y]) y = y - 1; 
return y; 


FIGURE 11-10. Integer log base 10 from log base 2, double 
table lookup. 


The next variation replaces the first table lookup with a 
subtraction, a multiplication, and a shift. This seems likely to be 
possible because log40ox and logox are related by a multiplicative 
constant, namely 102102 = 0.30103... Thus, it may be possible 
to compute iloglO(x) by computing |с ilog2(x); for some 
suitable c = 0.30103, and correcting the result by using a table 
such as table2 in Figure 11-11. 


Click here to view code image 


int 110910 (unsigned x) { 
int y; 
static unsigned char table1[33] = (10, 9, 9, 8, 8, 8, 
Tr Ty Tr δν Op “бус Ge B Sy Dp Oy Ἂν c4, 3y dg Зу Зу 
2 ο ο, j; 1,1, О, О, ος 045 


static unsigned table2[11] = {1, 10, 100, 1000, 10000, 
100000, 1000000, 10000000, 100000000, 1000000000, 
0}; 


у = tablel[nlz(x)]; 
у = у - (κ = table2[y]) >> 31); 
return y; 


FIGURE 11-11. Integer log base 10 from log base 2, double 
table lookup, branch free. 


To pursue this, let 108102 = c + €, where c > 0 is a rational 
approximation to log102 that is a convenient multiplier, and £ > 
0. Then, for x = 1, 


ilog10(x) = | logigx | = | (c + €)log,x | 
| clog;x | < ilog10(x) = | с log,x + € log,x | 
Le ilog2(x) | < ilog10(x) < | c (ilog2(x) + 1) + £ log;x | 
< | c ilog2(x) + c + £ log»x | 
«| c ilog2(x) | + | c +£ log, | + 1. 


Thus, if we choose c so that c + £logox < 1, then |с ilog2(x) | 
approximates ilog10(x) with an error of 0 or +1. Furthermore, 
if we take ilog2(0) = ilog10(0) = - 1, then |с ilog2(0); = 
ilog10(0) (because 0 « c x 1), so we need not be concerned 
about this case. (There are other definitions that would work 
here, such as ilog2(0) = ilog10(0) = 0.) 


Because € = 109102 - с, we must choose c so that 
c + (10812 —c)log,x < 1, or 


c(log,x — 1) > (log;92)log,x - 1. 


This is satisfied for x = 1 (because c < 1) and 2. For larger x, 
we must have 


(100 102)108›х — 1 
Ο»------------- 
log,x — 1 


The most stringent requirement on c occurs when x is large. For 
a 32-bit machine, x < 232, so choosing 


0.30103 - 32 – 1 
С > 
32-1 
suffices. Because c < 0.30103 (because € > 0), c = 9/32 = 


0.28125 is a convenient value. Experimentation reveals that 
coarser values such as 5/16 and 1/4 are not adequate. 


= 0.27848 


This leads to the scheme illustrated in Figure 11-12, which 
estimates low and then corrects by adding 1. It executes in about 
11 instructions on a RISC that has number of leading zeros, 
counting the multiply as one instruction. 


This can be made into a branch-free version, but again there 
is a difficulty with large x(x > 23! + 109), which can be fixed 


in either of two ways. One way is to use a different multiplier 
(19/64) and a slightly expanded table. The program is shown in 
Figure 11-13 (about 11 instructions on a RISC that has number 
of leading zeros, counting the multiply as one instruction). 


The other “fix” is to or x into the result of the subtraction to 
force the sign bit to be on for x = 231; that is, change the 
second executable line of Figure 11-12 to 


Click here to view code image 
y = y + (((table2[y*1] - x) | х) >> 31); 


This is the preferable program if multiplication by 19 is 
substantially more difficult than multiplication by 9 (as it is for a 
shift-and-add sequence). 


Click here to view code image 
static unsigned table2[10] = (0, 9, 99, 999, 9999, 


99999, 
999999, 9999999, 99999999, 999999999}; 


у = (9* (31 - nlz(x))) >> 5; 
if (х > table2[y*1]) y = y + 1; 
return y; 


FIGURE 11-12. Integer log base 10 from log base 2, one table 
lookup. 


Click here to view code image 


int 110910 (unsigned x) { 


int y; 

static unsigned table2[11] = {0, 9, 99, 999, 9999, 
99999, 999999, 9999999, 99999999, 999999999, 
OxFFFFFFFF}; 

y = (19*(31 - nlz(x))) >> 6; 

у = y + ((table2[y + 1] - x) >> 31); 

return y; 


FIGURE 11-13. Integer log base 10 from log base 2, one table 
lookup, branch free. 


For a 64-bit machine, choosing 


c > 9.30103 64 — 1. 0.28993 
64 —1 


suffices. The value 19/64 = 0.296875 is convenient, and 
experimentation reveals that no coarser value is adequate. The 
program is (branch-free version) 


Click here to view code image 


unsigned table2[20] = (0, 9, 99, 999, 9999, s 
9999999999999999999}; 

y = ((19* (63 - nlz(x)) >> 6; 

у = у + ((table2[y + 1] - х) >> 63; 


return y; 


Exercises 


1. 


Is the correct integer fourth root of an integer x obtained 
by computing the integer square root of the integer 
square root of x? That is, does 


4111-1454) 


. Code the 64-bit version of the cube root routine that is 


mentioned at the end of Section 11-2. Use the “long 
long" C data type. Do you see an alternative method for 
handling the overflow of » that probably results in a 
faster routine? 


. How many multiplications does it take to compute x23 


(modulo 2W, where W is the computer's word size)? 


. Describe in simple terms the functions (a) 210820) and (b) 


2ilog2G, - 1) + 1 for x an integer greater than 0. 


Chapter 12. Unusual Bases for Number 
Systems 


This section discusses a few unusual positional number 
systems. They are just interesting curiosities and are probably 
not practical for anything. We limit the discussion to integers, 
but they can all be extended to include digits after the radix 
point—which usually, but not always, denotes non-integers. 


12-1 Base -2 


By using -2 as the base, both positive and negative integers can 
be expressed without an explicit sign or other irregularity, such 
as having a negative weight for the most significant bit (Knu3). 
The digits used are O and 1, as in base +2; that is, the value 
represented by a string of 1’s and 0’s is understood to be 


(αῃ...α3α2α1α0) = απί-2)π + ... + аз(-2)3 + а2(-2)2 + αι(-2) 
+ ap. 


From this, it can be seen that a procedure for finding the base 
—2, or “negabinary,” representation of an integer is to 
successively divide the number Бу  —2, recording the 
remainders. The division must be such that it always gives a 
remainder of О or 1 (the digits to be used); that is, it must be 
modulus division. As an example, the plan below shows how to 
find the base — 2 representation of -3. 


— = 2 тет 1 
—2 

7 

— = —] rem 0 
—2 

= 1 rem 1 
—2 

l _ 

— = Orem! 


Because we have reached a 0 quotient, the process terminates (if 
continued, the remaining quotients and remainders would all be 
0). Thus, reading the remainders upward, we see that -3 is 
written 1101 in base -2. 


Table 12-1 shows, on the left, how each bit pattern from 
0000 to 1111 is interpreted in base -2, and on the right, how 
integers іп the range -15 to +15 are represented. 


TABLE 12-1. CONVERSIONS BETWEEN DECIMAL AND BASE-2 


n n n n -n 
(base —2) (decimal) | (decimal) (base —2) (base —2) 


10 
1101 
1100 
1111 
1110 
1001 
1000 
1011 
1010 

110101 
110100 
110111 
110110 
110001 


It is not obvious that the 2n possible bit patterns in an n-bit 
word uniquely represent all integers in a certain range, but this 
can be shown by induction. The inductive hypothesis is that an 
n-bit word represents all integers in the range 


—(2"*!—2)/3 to (2^ — 1)/3 for n even, and (la) 


(-(21--2)/3) to ((2"* 1 — 1)/3) for n odd. (1b) 
Assume first that n is even. For n = 2, the representable 
integers are 10, 11, 00, and 01 in base -2, or 
23 -1, 0, 1. 
This agrees with (1a) and each integer in the range is 
represented once and only once. 


A word of n + 1 bits can, with a leading bit of 0, represent 
all the integers given by (1a). In addition, with a leading bit of 
1, it can represent all these integers biased by (-2)n = 2n. The 
new range is 


2n — (2n + 1 _ 2) / 3 to 2n + (2n- 1)/3, 
Or 
(2n—1)/3 + 1 to (2n +2 — 1)/3. 


This is contiguous to the range given by (1a), so for a word size 
of n + 1 bits, all integers in the range 


—(2n * 1 — 2)/3 to (2n * 2 — 1)/3 
are represented once and only once. This agrees with (1b), with 


n replaced by n + 1. 


The proof that (1a) follows from (1b), for n odd, and that all 
integers in the range are uniquely represented, is similar. 


To add and subtract, the usual rules, such as 0 + 1 = 1 and 
1-1 = 0, of course apply. Because 2 is written 110, and -1 is 
written 11, and so on, the following additional rules apply. 
These, together with the obvious ones, suffice. 


1+1 = 110 
11+1 = 0 
1+1+1 = 111 
0-1 = 1 
11-1 = 10 
When adding or subtracting, there are sometimes two carry 
bits. The carry bits are to be added to their column, even when 


subtracting. It is convenient to place them both over the next bit 
to the left and simplify (when possible) using 11 + 1 = 0. If 11 
is carried to a column that contains two 0’s, bring down a 1 and 
carry a 1. Below are examples. 


Click here to view code image 


Addition Subtraction 
LLL 1l 11 ат 1 T 
1 ο 1 1 1 19 1 0 1 ο 1 21 
t. l1. 1. 0 1 ο 1 (141) = 10 2.22, do Ὁ = 
(-38) 
0 1 1 0 0 Ὁ 8 1 ο ο L i1 тт 59 


The only carries possible are 0, 1, and 11. Overflow occurs if 
there is a carry (either 1 or 11) out of the high-order position. 
These remarks apply to both addition and subtraction. 


Because there are three possibilities for the carry, a base —2 
adder would be more complex than a two's-complement adder. 


There are two ways to negate an integer. It can be added to 
itself shifted left one position (that is, multiply by -1), or it can 
be subtracted from 0. There is no rule as simple and convenient 
as the “complement and add 1” rule of two's-complement 
arithmetic. In two's-complement, this rule is used to build a 
subtracter from an adder (to compute A - B, form A + B + 1). 


For base -2, there is no device quite that simple, but a 
method that is nearly as simple is to complement the minuend 
(meaning to invert each bit), add the complemented minuend to 
the subtrahend, and then complement the sum [Lang]. Here is 
an example showing the subtraction of 13 from 6 using this 
scheme on an eight-bit machine. 


Click here to view code image 


00011010 6 

00011101 13 

11100101 6 complemented 

11110110 (6 complemented) + 13 
00001001 Complement of the sum (-7) 


This method is using 


А-В-1-(0-А)-8) 


in base -2 arithmetic, with I a word of all 15. 


Multiplication of base —2 integers is straightforward. Just use 
the rule that 1 x 1 = 1 and O times either O or 1 is 0, and add 
the columns using base --2 addition. 


Division, however, is quite complicated. It is a real challenge 
to devise a reasonable hardware division algorithm—that is, one 
based on repeated subtraction and shifting. Figure 12-1 shows 
an algorithm that is expressed, for definiteness, for an 8-bit 
machine. It does modulus division (nonnegative remainder). 


Although this program is written in C and was tested on a 
binary two's-complement machine, that is immaterial—it should 
be viewed somewhat abstractly. The input quantities n and a, 
and all internal variables except for a, are simply numbers 
without any particular representation. The output q is a string of 
bits to be interpreted in base -2. 


This requires a little explanation. If the input quantities were 
in base -2, the algorithm would be very awkward to express in 
an executable form. For example, the test “i£ (а > 0)” would 
have to test that the most significant bit of a is in an even 
position. The addition in “с = с + a" would have to be a base – 
2 addition. The code would be very hard to read. The way the 
algorithm is coded, you should think of n and a as numbers 
without any particular representation. The code shows the 
arithmetic operations to be performed, whatever encoding is 
used. If the numbers are encoded in base -2, as they would be in 
hardware that implements this algorithm, the multiplication by 
—128 is a left shift of seven positions, and the divisions by -2 are 
right shifts of one position. 


As examples, the code computes values as follows: 
divbm2(6, 2) = 7 (six divided by two is 111-2) 
divbm2(- 4, 3) = 2 (minus four divided by three is 10. 2) 
divbm2(-4, -3) = 6 (minus four divided by minus З is 110. 
2) 


Click here to view code image 


int divbm2(int n, int d) ( // а = n/d in base -2. 
int τ, dw, C, а, i; 


=; // Init. remainder. 


dw = (-128)*d; // Position d. 
с =. (-43)*а; // Init. comparand. 
if (d > 0). € — G + d; 
а = 0; // Init. quotient. 
for (1 = 7; i >= 0; i--) { 
if (а > 0^ (161) == 0^ Е >= с í 
а= а | (1<< i); // Set а quotient bit. 
r = г - dw; // Subtract d shifted. 
} 
dw = dw/(-2); // Position d. 
if (d > 0) c = c -2*d; // Set comparand for 
else с = c + q; // next iteration. 
с = c/(-2); 
} 
return q; // Return quotient in 
// base -2. 


// Remainder is r, 
} // 0 <= г < |а]. 


FIGURE 12-1. Division in base -2. 


The step q = q | (1«« i); represents simply setting bit i of а. 
The next line—r = r - dw—represents reducing the remainder 
by the divisor a shifted left. 


The algorithm is difficult to describe in detail, but we will try 
to give the general idea. 


Consider determining the value of the first bit of the quotient, 
bit 7 of а. In base —2, 8-bit numbers that have their most 
significant bit “on” range in value from —170 to —43. 
Therefore, ignoring the possibility of overflow, the first (most 
significant) quotient bit will be 1 if (and only if) the quotient 
will be algebraically less than or equal to —43. 


» 


Because n = qd + rand for a positive divisor r < d — 1, for 
a positive divisor the first quotient bit will be 1 iff = — 43d + 
(d — 1), or n « — 43d - d. For a negative divisor, the first 
quotient bit will be 1 iff n > -434 (r = 0 for modulus 
division). 

Thus, the first quotient bit is 1 iff 


z 
= 


(а > 0&-(п = —43а + d) | а < O&n > – 434). 


Ignoring the possibility that d = 0, this can be written as 


d>0@n=c, 


where с = —43d + difd = 0, andc = —43d if d < 0. 


This is the logic for determining a quotient bit for an odd- 
numbered bit position. For an even-numbered position, the logic 
is reversed. Hence, the test includes the term (161) == О. (The ^ 
character in the program denotes exclusive or.) 


At each iteration, c is set equal to the smallest (closest to 
zero) integer that must have a 1-bit at position i after dividing 
by a. If the current remainder r exceeds that, then bit i of q is 
set to 1 and r is adjusted by subtracting the value of a 1 at that 
position, multiplied by the divisor a. No real multiplication is 
required here; a is simply positioned properly and subtracted. 


The algorithm is not elegant. It is awkward to implement 
because there are several additions, subtractions, and 
comparisons, and there is even a multiplication (by a constant) 
that must be done at the beginning. One might hope for a 
“uniform” algorithm—one that does not test the signs of the 
arguments and do different things depending on the outcome. 
Such a uniform algorithm, however, probably does not exist for 
base -2 (or for two’s-complement arithmetic). The reason for 
this is that division is inherently a non-uniform process. 
Consider the simplest algorithm of the shift-and-subtract type. 
This algorithm would not shift at all, but for positive arguments 
would simply subtract the divisor from the dividend repeatedly, 
counting the number of subtractions performed until the 
remainder is less than the divisor. On the other hand, if the 
dividend is negative (and the divisor is positive), the process is 
to add the divisor repeatedly until the remainder is 0 or positive, 
and the quotient is the negative of the count obtained. The 
process is still different if the divisor is negative. 


In spite of this, division is a uniform process for the signed- 
magnitude representation of numbers. With such a 
representation, the magnitudes are positive, so the algorithm can 
simply subtract magnitudes and count until the remainder is 
negative, and then set the sign bit of the quotient to the exclusive 
or of the arguments, and the sign bit of the remainder equal to 
the sign of the dividend (this gives ordinary truncating division). 


The algorithm given above could be made more uniform, in a 
sense, by first complementing the divisor, if it is negative, and 


then performing the steps given as simplified by having d > 0. 
Then a correction would be performed at the end. For modulus 
division, the correction is to negate the quotient and leave the 
remainder unchanged. This moves some of the tests out of the 
loop, but the algorithm as a whole is still not pretty. 


It is interesting to contrast the commonly used number 
representations and base -2 regarding the question of whether 
or not the computer hardware treats numbers uniformly in 
carrying out the four fundamental arithmetic operations. We 
don't have a precise definition of *uniformly," but basically it 
means free of operations that might or might not be done, 
depending on the signs of the arguments. We consider setting 
the sign bit of the result equal to the exclusive or of the signs of 
the arguments to be a uniform operation. Table 12-2 shows 
which operations treat their operands uniformly with various 
number representations. 


One’s-complement addition and subtraction are done 
uniformly by means of the “end around carry" trick. For 
addition, all bits, including the sign bit, are added in the usual 
binary way, and the carry out of the leftmost bit (the sign bit) is 
added to the least significant position. This process always 
terminates right away (that is, the addition of the carry cannot 
generate another carry out of the sign bit position). 


TABLE 12-2. UNIFORM OPERATIONS IN VARIOUS NUMBER 
ENCODINGS 


Signed- One's- Two's- 
magnitude complement complement 


addition no 


subtraction 


multiplication 


division 


In the case of two's-complement multiplication, the entry is 
“yes” if only the right half of the doubleword product is desired. 


We conclude this discussion of the base -2 number system 
with some observations about how to convert between straight 
binary and base -2. 


To convert to binary from base -2, form a word that has only 
the bits with positive weight, and subtract a word that has only 


the bits with negative weight, using the subtraction rules of 
binary arithmetic. An alternative method that may be a little 
simpler is to extract the bits appearing in the negative weight 
positions, shift them one position to the left, and subtract the 
extracted number from the original number using the 
subtraction rules of ordinary binary arithmetic. 


To convert to base -2 from binary, extract the bits appearing 
in the odd positions (positions weighted by 2n with n odd), shift 
them one position to the left, and add the two numbers using the 
addition rules of base —2. Here are two examples: 


Binary from base —2 Base —2 from binary 
110111 (-13) 110111 (55) 
-101 (binary subtract) να 0:3 (base -2 add) 
...111110011 (-13) 1001011 (55) 


On a computer, with its fixed word size, these conversions 
work for negative numbers if the carries out of the high-order 
position are simply discarded. To illustrate, the example on the 
right above can be regarded as converting —9 to base —2 from 
binary if the word size is six bits. 


The above algorithm for converting to base — 2 cannot easily 
be implemented in software on a binary computer, because it 
requires doing addition in base — 2. Schroeppel [HAK, item 128] 
overcomes this with a much more clever and useful way to do 
the conversions in both directions. To convert to binary, his 
method is 


B < (N Ф 0b10... 1010) - ΟΡΙΟ... 1010. 


To see why this works, let the base -2 number consist of the 
four digits abcd. Then, interpreted (erroneously) in straight 
binary, this is 8a + 4b + 2 c + d. After the exclusive or, 
interpreted in binary it is 8(1 — a) + 4b + 2(1 — c) + d. After 
the (binary) subtraction of 8 + 2, itis — 8a + 4b — 2c + d, 
which is its value interpreted in base -2. 


Schroeppel's formula can be readily solved for N in terms of 
B, so it gives a three-instruction method for converting in the 
other direction. Collecting these results, we have the following 
formulas for converting to binary for a 32-bit machine: 


В < (М & 0x55555555) — (М ἃ -ΟΧ55555555), 
В < М — ((N& OXAAAAAAAA) << 1), 
В < (N Ф OXAAAAAAAA) — ОХАААААААА, 


and the following, for converting to base -2 from binary: 
М < (В + OXAAAAAAAA) Ф OXAAAAAAAA. 
12-2 Base -1 + i 


By using – 1 + ias the base, where i is 4-1 , ай complex integers 
(complex numbers with integral real and imaginary parts) can 


be expressed as a single *number" without an explicit sign or 
other irregularity. Surprisingly, this can be done using only 0 
and 1 for digits, and all integers are represented uniquely. We 
will not prove this or much else about this number system, but 
will just describe it very briefly. 


It is not entirely trivial to discover how to write the integer 
2.1 But it can be determined algorithmically by successively 
dividing 2 by the base and recording the remainders. What does 
a “remainder” mean in this context? We want the remainder 
after dividing by — 1 + i to be O or 1, if possible (so that the 
digits will be O or 1). To see that it is always possible, assume 
that we are to divide an arbitrary complex integer a + bi by — 
1 + i. Then, we wish to find q and r such that q is a complex 
integer, r — O or 1, and 


а + bi = (qr + qi)( — 1 + д +, 


where q, and 4, denote the real and imaginary parts of q, 
respectively. Equating real and imaginary parts and solving the 
two simultaneous equations for q gives 


nm a+r 
q, ‚ and 


Clearly, if a and b are both even or are both odd, then by 
choosing r — 0, q is a complex integer. Furthermore, if one of a 
and b is even and the other is odd, then by choosing г = 1, qisa 
complex integer. 


Thus, the integer 2 can be converted to base - 1 + i by the 
plan illustrated below. 


Because the real and imaginary parts of the integer 2 are both 
even, we simply do the division, knowing that the remainder 
will be 0: 


2... ЭЙ 
1+1 (-1-*i)Cl-i) 


Because the real and imaginary parts of -- 1 — i are both odd, 
again we simply divide, knowing that the remainder is 0: 


= — ] — ггет 0. 


zli CI DCL i) = j remo. 
- 1+1 (-- 1+ ὑ(-- 1 -- 
Because the real and imaginary parts of i are even and odd, 
respectively, the remainder will be 1. It is simplest to account 
for this at the beginning by subtracting 1 from the dividend. 


i- 1 


= | (remainder is 1). 
-141 


Because the real and imaginary parts of 1 are odd and even, 
the next remainder will be 1. Subtracting this from the dividend 
gives 


qe = ( (remainder is 1). 
- 1+1 
Because we have reached a 0 quotient, the process 
terminates, and the base — 1 + i representation for 2 is seen to 


be 1100 (reading the remainders upward). 

Table 12-3 shows how each bit pattern from 0000 to 1111 is 
interpreted in base - 1 + i and how the real integers in the 
range -15 to +15 are represented. 

The addition rules for base – 1 + i (in addition to the trivial 
ones involving a 0-bit) are as follows: 


1+1 = 1100 

1+1+1 = 1101 
1+1+1+1 = 111010000 
1+1+1+1+1 = 111010001 
1+1+1+1+1+1 = 111011100 
1+1+1+1+1+1+1 = 111011101 
1+1+1+1+1+1+1+1 = 111000000 


TABLE 12-3. CONVERSIONS BETWEEN DECIMAL AND BASE -1 + i 


n n n n -n 
(base-1 +j) (decimal) (decimal) (base -1 + i) (base -1 + i) 


0 0 0 0 

1 1 1 11101 

10 —1 +] 2 1100 11100 

11 i 4 1101 10001 
100 -2i 111010000 10000 
101 1-2 - 111010001 11001101 
110 -l —i 111011100 11001100 
111 -4 111011101 11000001 
1000 2-2 111000000 11000000 
1001 3 * 2i 111000001 11011101 
1010 1+ 3i 111001100 11011100 
1011 2 3i 111001101 11010001 
1100 2 100010000 11010000 
1101 3 100010001 1110100001101 
1110 1+2 100011100 1110100001100 


1111 2-1 5 100011101 1110100000001 


When adding two numbers, the largest number of carries that 
occurs in one column is six, so the largest sum of a column is 8 
(111000000). This makes for a rather complicated adder. If one 
were to build a complex arithmetic machine, it would no doubt 
be best to keep the real and imaginary parts separate,2 with each 
represented in some sensible way such as two's-complement. 


12-3 Other Bases 


The base - 1 – i has essentially the same properties as the base – 
1 + i discussed above. If a certain bit pattern represents the 
number a + bi in one of these bases, then the same bit pattern 
represents the number a - bi in the other base. 


The bases 1 + i and 1- i can also represent all the complex 
integers, using only 0 and 1 for digits. These two bases have the 
same complex-conjugate relationship to each other, as do the 
Баѕеѕ – 1 + i. In bases 1 + i, the representation of some integers 
has an infinite string of l's on the left, similar to the two's- 
complement representation of negative integers. This arises 
naturally by using uniform rules for addition and subtraction, as 
in the case of two's-complement. One such integer is 2, which 
(in either base) is written ...11101100. Thus, these bases have 
the rather complex addition rule 1 + 1 = ...11101100. 


By grouping into pairs the bits in the base -2 representation 
of an integer, one obtains a base 4 representation for the positive 
and negative numbers, using the digits -2, -1, 0, and 1. For 
example, 


—41 —2.40 


Similarly, by grouping into pairs the bits in the base – 1 + i 
representation of a complex integer, we obtain a base -2i 
representation for the complex integers using the digits 0, 1, — 1 
+ i, and i. This is a bit too complicated to be interesting. 


The “quater-imaginary” system (Knu2) is similar. It 
represents the complex integers using 2i as a base, and the digits 
0, 1, 2, and 3 (with no sign). To represent some integers, namely 
those with an odd imaginary component, it is necessary to use a 
digit to the right of the radix point. For example, i is written 
10.2 in base 2i. 


12-4 What Is the Most Efficient Base? 


Suppose you are building a computer and you are trying to 
decide what base to use to represent integers. For the registers 
you have available circuits that are 2-state (binary), 3-state, 4- 
state, and so on. Which should you use? 


Let us assume that the cost of a b-state circuit is proportional 
to b. Thus, a 3-state circuit costs 5096 more than a binary circuit, 


a 4-state circuit costs twice as much as a binary circuit, and so 
on. 


Suppose you want the registers to be able to hold integers 
from 0 to some maximum M. Encoding integers from 0 to M in 
base b requires [logb(M + 1)! digits (e.g., to represent all 
integers from 0 to 999,999 in decimal requires log10(1,000,000) 
— 6 digits). 

One would expect the cost of a register to be equal to the 
product of the number of digits required times the cost to 
represent each digit: 


c = klogb(M + 1) · Б, 


where c is the cost of a register and k is a constant of 
proportionality. For a given M, we wish to find b that minimizes 
the cost. 

The minimum of this function occurs for that value of b that 
makes dc/db = 0. Thus, we have 


Inb- | 
(Inb)? 


d _ df,,In(M+1) 
4 (kblog( M 1)) = S (κ PMD 
dp A OB ) = ap ^ ар 


This is zero when 1nb = 1, orb = е. 


J = kln(M+ 1) 


This is not a very satisfactory result. Because e = 2.718, 2 
and 3 must be the most efficient integral bases. Which is more 
efficient? The ratio of the cost of a base 2 register to the cost of a 
base 3 register is 


c(2) | &-2logM* 1) _ 2In(M+ 1)/(In2) _ 23 
c(3) К: 310 (М+П 3ln(M-*1)/(In3) 3in2 


= 1.056. 


Thus, base 2 is more costly than base 3, but only by a small 
amount. 


By the same analysis, base 2 is more costly than base e by a 
factor of about 1.062. 
Exercises 


1. Schroeppel’s formula for converting from base -2 to 
binary has a dual involving the constant 0x5555555. Can 
you find it? 


2. Show how to add 1 to a base -2 number using the 


arithmetic and logical operations of a binary computer. 
For example, 0b111 = 0b100. 


3. Show how to round a base -2 number down (in the 
negative direction) to a multiple of 16 using the 
arithmetic and logical operations of a binary computer. 
For example, 0b10 = 0b110000. 


4. Write a program, in a language of your choice, to convert 
a base - 1 + i integer to the form a + bi, where a and b 
are real integers. For example, if you give the program 
the integer 33, or 0x21, it should display something like 5 
— 4i. 

5. How would you convert a number in base — 1 + i to its 
negative? Extract its real part? Extract its imaginary part? 
Convert it to its complex conjugate? (The complex 
conjugate of a + bi isa — bi.) 


Chapter 13. Gray Code 


13-1 Gray Code 


Is it possible to cycle through all 2n combinations of n bits by 
changing only one bit at a time? The answer is *yes," and this is 
the defining property of Gray codes. That is, a Gray code is an 
encoding of the integers such that a Gray-coded integer and its 
successor differ in only one bit position. This concept can be 
generalized to apply to any base, such as decimal, but here we 
will discuss only binary Gray codes. 

Although there are many binary Gray codes, we will discuss 
only one: the "reflected binary Gray code." This code is what is 
usually meant in the literature by the unqualified term “Gray 
code." We will show, usually without proof, how to do some 
basic operations in this representation of integers, and we will 
show a few surprising properties. 

The reflected binary Gray code is constructed as follows. Start 
with the strings 0 and 1, representing the integers 0 and 1: 


0 
1 
Reflect this about a horizontal axis at the bottom of the list, 
and place a 1 to the left of the new list entries and a O to the left 
of the original list entries: 


00 
01 
11 
10 
This is the reflected binary Gray code for n — 2. To get the 
code for n = 3, reflect this and attach a О or 1 as before: 


000 
001 
011 
010 
110 
111 
101 
100 
From this construction, it is easy to see by induction on n that 
(1) each of the 2n bit combinations appears once and only once 
in the list, (2) only one bit changes in going from one list entry 
to the next, and (3) only one bit changes when cycling around 
from the last entry to the first. Gray codes having this last 
property are called “cyclic,” and the reflected binary Gray code 
is necessarily cyclic. 


If n > 2, there are non-cyclic codes that take on all 2n values 
once and only once. One such code is 000 001 011 010 110 100 
101 111. 


Figure 13-1 shows, for n = 4, the integers encoded in 
ordinary binary and in Gray code. The formulas show how to 
convert from one representation to the other at the bit-by-bit 
level (as it would be done in hardware). 


Binary 
abcd 


0000 
0001 
0010 
0011 
0100 
0101 
0110 
0111 
1000 
1001 
1010 
1011 
1100 
1101 
1110 
1111 


Gray 
efgh 
0000 
0001 
0011 
0010 
0110 
0111 

0101 
0100 
1100 
1101 

1111 

1110 
1010 
1011 

1001 

1000 


Gray from Binary Binary from Gray 
e-a а=е 

f=a®b b-eQf 
g-bGc c=e@fOg 
h-cGd d-eGfGegGh 


FIGURE 13-1. 4-bit Gray code and conversion formulas. 


As for the number of Gray codes on n bits, notice that one 
still has a cyclic binary Gray code after rotating the list (starting 
at any of the 2n positions and cycling around) or reordering the 
columns. Any combination of these operations results in a 
distinct code. Therefore, there are at least 2n · n! cyclic binary 
Gray codes on n bits. There are more than this for n > 3. 


The Gray code and binary representations have the following 
dual relationships, evident from the formulas given in Figure 


13-1: 


* Bit i of a Gray-coded integer is the parity of bit i and the 
bit to the left of i in the corresponding binary integer 
(using 0 if there is no bit to the left of i). 


* Bit i of a binary integer is the parity of all the bits at and 
to the left of position i in the corresponding Gray-coded 
integer. 


Converting to Gray from binary can be done in only two 


instructions: 


Се-ВФӨФ(В-» 1). 


The conversion to binary from Gray is harder. One method is 
given by 


n-1 
Be @Gši. 
i= 0 
We have already seen this formula in “Computing the Parity of a 
Word” on page 96. As mentioned there, this formula can be 
evaluated as illustrated below for n = 32. 


Click here to view code image 


B = 6 ^ (G >> 1); 
E = B % (B >> 2); 
B = B “ (B >> 4); 
B-B^ (В >> 8); 
В = В ^ (B >> 16); 


: : : 2. ‚УЕ : 
Thus, in general it requires ^ | log; е ] instructions. 


Because it is so easy to convert from binary to Gray, it is 
trivial to generate successive Gray-coded integers: 


Click here to view code image 
for (i = 0; i < n; i++) { 
@ = j % (ї >> 1); 


output G; 
} 


13-2 Incrementing a Gray-Coded Integer 


The logic for incrementing a 4-bit binary integer abcd can be 
expressed as follows, using Boolean algebra notation: 


а = а 
ο = cod 
b' = b®cd 


а = а® bcd 
Thus, one way to build a Gray-coded counter in hardware is to 
build a binary counter using the above logic and convert the 
outputs a’, b’, c’, d' to Gray by forming the exclusive or of 
adjacent bits, as shown under “Gray from Binary” in Figure 13- 
1. 


A way that might be slightly better is described by the 
following formulas: 


p=e®@fOg@Gh 


h'=h@®p 
g = gGhp 
f = f@ ghp 


е' = eG fghp 


That is, the general case is 


С, = G, @ (G, ,G,-2...Gop), n22. 


Because the parity p alternates between 0 and 1, a counter 
circuit might maintain p in a separate 1-bit register and simply 
invert it on each count. 


In software, the best way to find the successor G' of a Gray- 
coded integer G is probably simply to convert G to binary, 
increment the binary word, and convert it back to Gray code. 
Another way that's interesting and almost as good is to 
determine which bit to flip in G. The pattern goes like this, 
expressed as a word to be exclusive or'd to G: 


12141218121412116 


The alert reader will recognize this as a mask that identifies 
the position of the leftmost bit that changes when incrementing 
the integer 0, 1, 2, 3, ..., corresponding to the positions in the 
above list. Thus, to increment a Gray-coded integer G, the bit 
position to invert is given by the leftmost bit that changes when 
1 is added to the binary integer corresponding to G. 


This leads to the algorithms for incrementing a Gray-coded 
integer G as shown in Figure 13-2. They both first convert G to 
binary, which is shown as index (с). 


Click here to view code image 


FIGURE 13-2. Incrementing a Gray-coded integer. 


A pencil-and-paper method of incrementing a Gray-coded 
integer is as follows: 


Starting from the right, find the first place at which the 
parity of bits at and to the left of the position is even. 
Invert the bit at this position. 


Or, equivalently: 


Let p be the parity of the word G. If p is even, invert the 
rightmost bit. 


If p is odd, invert the bit to the left of the rightmost 1-bit. 


The latter rule is directly expressed in the Boolean equations 
given above. 


13-3 Negabinary Gray Code 


If you write the integers in order in base -2 and convert them 
using the "shift and exclusive or" that converts to Gray from 
straight binary, you get a Gray code. The 3-bit Gray code has 
indexes that range over the 3-bit base -2 numbers, namely -2 to 
5. Similarly, the 4-bit Gray code corresponding to 4-bit base --2 
numbers has indexes ranging from -10 to 5. It is not a reflected 
Gray code, but it almost is. The 4-bit negabinary Gray code can 
be generated by starting with O and 1, reflecting this about a 
horizontal axis at the top of the list, and then reflecting it about 
a horizontal axis at the bottom of the list, and so on. It is cyclic. 


To convert back to base -2 from this Gray code, the rules are, 
of course, the same as they are for converting to straight binary 
from ordinary reflected binary Gray code (because these 
operations are inverses, no matter what the interpretation of the 
bit strings is). 


13-4 Brief History and Applications 


Gray codes are named after Frank Gray, a physicist at Bell 
Telephone Laboratories, who in the 1930s invented the method 
we now use for broadcasting color TV in a way that's compatible 
with the black-and-white transmission and reception methods 


then in existence; that is, when the color signal is received by a 
black-and-white set, the picture appears in shades of gray. 


Martin Gardner [Gard] discusses applications of Gray codes 
involving the Chinese ring puzzle, the Tower of Hanoi puzzle, 
and Hamiltonian paths through graphs that represent 
hypercubes. He also shows how to convert from the decimal 
representation of an integer to a decimal Gray code 
representation. 


Gray codes are used in position sensors. A strip of material is 
made with conducting and nonconducting areas, corresponding 
to the 1’s and 0’s of a Gray-coded integer. Each column has a 
conducting wire brush positioned to read it out. If a brush is 
positioned on the dividing line between two of the quantized 
positions so that its reading is ambiguous, then it doesn't matter 
which way the ambiguity is resolved. There can be only one 
ambiguous brush, and interpreting it as a O or 1 gives a position 
adjacent to the dividing line. 


The strip can instead be a series of concentric circular tracks, 
giving a rotational position sensor. For this application, the Gray 
code must be cyclic. Such a sensor is shown in Figure 13-3, 
where the four dots represent the brushes. 


It is possible to construct cyclic Gray codes for rotational 
sensors that require only one ring of conducting and 
nonconducting areas, although at some expense in resolution for 
a given number of brushes. The brushes are spaced around the 
ring rather than on a radial line. These codes are called single 
track Gray codes, or STGCs. 


The idea is to find a code for which, when written out as in 
Figure 13-1, every column is a rotation of the first column (and 
that is cyclic, assuming the code is for a rotational device). The 
reflected Gray code for n = 2 is trivially an STGC. STGCs for n 
— 2 through 4 are shown here. 


n=2 n=3 n=4 
00 000 0000 
01 001 0001 
11 011 0011 
10 111 0111 
110 1111 

100 1110 

1100 

1000 

STGCs allow the construction of more compact rotational 
position sensors. A rotational STGC device for n — 3 is shown in 
Figure 13-4. 

These are all very similar, simple, and rather uninteresting 
patterns. Following these patterns, an STGC for the case n = 5 
would have ten code words, giving a resolution of 36 degrees. It 
is possible to do much better. Figure 13-5 shows an STGC for n 
— 5 with 30 code words, giving a resolution of 12 degrees. It is 
close to the optimum of 32 code words. 


FIGURE 13-3. Rotational position sensor. 


FIGURE 13-4. Single track rotational position sensor. 


10000 01000 00100 00010 00001 

10100 01010 00101 10010 01001 

11100 01110 00111 10011 11001 

11110 01111 10111 11011 11101 

11010 01101 10110 01011 10101 

11000 01100 00110 00011 10001 
FIGURE 13-5. An STGC for n = 5. 


All the STGCs in this section above are the best possible, in 
the sense that for n — 2 through 5, the largest number of code 
words possible is 4, 6, 8, and 30. 


An STGC has been constructed with exactly 360 code words, 
with n — 9 (the smallest possible value of n, because any code 
for n — 8 has at most 256 code words) [HilPat]. 


Exercises 


1. Show that if an integer x is even, then G(x) (the reflected 
binary Gray code of x) has an even number of 1-bits, and 
if x is odd, G(x) has an odd number of 1-bits. 


2. A balanced Gray code is a cyclic Gray code in which the 
number of bit changes is the same in all columns, as one 
cycles around the code. 


(a) Show that an STGC is necessarily balanced. 


(b) Can you find a balanced Gray code for n — 3 that has 
eight code words? 


3. Devise a cyclic Gray code that encodes the integers from 0 


to 9. 


4. [Knu6] Given a number in prime decomposed form, show 
how to list all its divisors in such a way that each divisor 
in the list is derived from the previous divisor by a single 
multiplication or division by a prime. 


Chapter 14. Cyclic Redundancy Check 


14-1 Introduction 


The cyclic redundancy check, or CRC, is a technique for 
detecting errors in digital data, but not for making corrections 
when errors are detected. It is used primarily in data 
transmission. In the CRC method, a certain number of check bits, 
often called a checksum, or a hash code, are appended to the 
message being transmitted. The receiver can determine whether 
or not the check bits agree with the data to ascertain with a 
certain degree of probability that an error occurred in 
transmission. If an error occurred, the receiver sends a “negative 
acknowledgment" (NAK) back to the sender, requesting that the 
message be retransmitted. 


The technique is also sometimes applied to data storage 
devices, such as a disk drive. In this situation each block on the 
disk would have check bits, and the hardware might 
automatically initiate a reread of the block when an error is 
detected, or it might report the error to software. 


The material that follows speaks in terms of a “sender” and a 
"receiver" of a *message," but it should be understood that it 
applies to storage writing and reading as well. 


Section 14-2 describes the theory behind the CRC 
methodology. Section 14-3 shows how the theory is put into 
practice in hardware, and gives a software implementation of a 
popular method known as CRC-32. 


Background 


There are several techniques for generating check bits that can 
be added to a message. Perhaps the simplest is to append a 
single bit, called the “parity bit,” which makes the total number 
of 1-bits in the code vector (message with parity bit appended) 
even (or odd). If a single bit gets altered in transmission, this 
will change the parity from even to odd (or the reverse). The 
sender generates the parity bit by simply summing the message 
bits modulo 2—that is, by exclusive or'ing them together. It then 
appends the parity bit (or its complement) to the message. The 


receiver can check the message by summing all the message bits 
modulo 2 and checking that the sum agrees with the parity bit. 
Equivalently, the receiver can sum all the bits (message and 
parity) and check that the result is O (if even parity is being 
used). 


This simple parity technique is often said to detect 1-bit 
errors. Actually, it detects errors in any odd number of bits 
(including the parity bit), but it is a small comfort to know you 
are detecting 3-bit errors if you are missing 2-bit errors. 


For bit serial sending and receiving, the hardware required to 
generate and check a single parity bit is very simple. It consists 
of a single exclusive or gate together with some control circuitry. 
For bit parallel transmission, an exclusive or tree may be used, as 
illustrated in Figure 14-1. Efficient ways to compute the parity 
bit in software are given in Section 5-2 on page 96. 


Parity bit (even) 


FIGURE 14-1. Exclusive or tree. 


Other techniques for computing a checksum are to form the 
exclusive or of all the bytes in the message, or to compute a sum 
with end-around carry of all the bytes. In the latter method, the 
carry from each 8-bit sum is added into the least significant bit 
of the accumulator. It is believed that this is more likely to 
detect errors than the simple exclusive or, or the sum of the bytes 
with carry discarded. 


A technique that is believed to be quite good in terms of error 


detection, and which is easy to implement in hardware, is the 
cyclic redundancy check. This is another way to compute a 
checksum, usually eight, 16, or 32 bits in length, that is 
appended to the message. We will briefly review the theory, 
show how the theory is implemented in hardware, and then give 
software for a commonly used 32-bit CRC checksum. 


We should mention that there are much more sophisticated 
ways to compute a checksum, or hash code, for data. Examples 
are the hash functions known as MD5 and SHA-1, whose hash 
codes are 128 and 160 bits in length, respectively. These 
methods are used mainly in cryptographic applications and are 
substantially more difficult to implement, in hardware and 
software, than the CRC methodology described here. However, 
SHA-1 is used in certain revision control systems (Git and 
others) as simply a check on data integrity. 


14-2 Theory 


The CRC is based on polynomial arithmetic, in particular, on 
computing the remainder when dividing one polynomial in 
GF(2) (Galois field with two elements) by another. It is a little 
like treating the message as a very large binary number, and 
computing the remainder when dividing it by a fairly large 
prime such as 232 — 5. Intuitively, one would expect this to give 
a reliable checksum. 


A polynomial in GF(2) is a polynomial in a single variable x 
whose coefficients are 0 or 1. Addition and subtraction are done 
modulo 2—that is, they are both the same as the exclusive or 
operation. For example, the sum of the polynomials 


x3+x+1 and 


хех Έα Τὰ 
is x4 + x2 + 1, as is their difference. These polynomials are not 
usually written with minus signs, but they could be, because a 
coefficient of -1 is equivalent to a coefficient of 1. 


Multiplication of such polynomials is straightforward. The 
product of one coefficient by another is the same as their 
combination by the logical and operator, and the partial 
products are summed using exclusive or. Multiplication is not 
needed to compute the CRC checksum. 


Division of polynomials over GF(2) can be done in much the 
same way as long division of polynomials over the integers. Here 
is an example. 


x! cx] 
x -x* D) x! + xŠ + x54 х?+х 
x! + х? + x 
хе + x 
χ5 + x! + х? 
X3 + x2 + x 
+ х +1 
х? + l 


The reader may verify that the quotient x^ + x? + 1 multiplied 
by the divisor x? + x + 1, plus the remainder x2 + 1, equals 
the dividend. 


The CRC method treats the message as a polynomial in GF(2). 
For example, the message 11001001, where the order of 
transmission is from left to right (110...), is treated as a 
representation of the polynomial x7 + x9 + x? + 1. The sender 
and receiver agree on a certain fixed polynomial called the 
generator polynomial. For example, for a 16-bit CRC the CCITT 
(Le Comité Consultatif International  Télégraphique et 
Téléphonique)1 has chosen the polynomial x16 + x12 + x5 + 1, 
which is now widely used for a 16-bit CRC checksum. To 
compute an r-bit CRC checksum, the generator polynomial must 
be of degree г. The sender appends г 0-bits to the m-bit message 
and divides the resulting polynomial of degree т + г-1 by the 
generator polynomial. This produces a remainder polynomial of 
degree г - 1 (or less) The remainder polynomial has г 
coefficients, which are the checksum. The quotient polynomial is 
discarded. The data transmitted (the code vector) is the original 
m-bit message followed by the r-bit checksum. 


There are two ways for the receiver to assess the correctness 
of the transmission. It can compute the checksum from the first 
m bits of the received data and verify that it agrees with the last 
r received bits. Alternatively, and following usual practice, the 
receiver can divide all the m + r received bits by the generator 
polynomial and check that the r-bit remainder is 0. To see that 
the remainder must be 0, let M be the polynomial representation 


of the message, and let R be the polynomial representation of 
the remainder that was computed by the sender. Then the 
transmitted data corresponds to the polynomial Mx" - R (or, 
equivalently, Mx" + В). By the way R was computed, we know 
that Мх" = ОС + В, where С is the generator polynomial and Q 
is the quotient (that was discarded). Therefore the transmitted 
data, Mx" - R, is equal to ОС, which is clearly a multiple of G. If 
the receiver is built as nearly as possible just like the sender, the 
receiver will append г O-bits to the received data as it computes 
the remainder К. The received data with O-bits appended is still 
a multiple of G, so the computed remainder is still 0. 


That's the basic idea, but in reality the process is altered 
slightly to correct for certain deficiencies. For example, the 
method as described is insensitive to the number of leading and 
trailing O-bits in the data transmitted. In particular, if a failure 
occurred that caused the received data, including the checksum, 
to be all-0, it would be accepted. 


Choosing a *good" generator polynomial is something of an 
art and beyond the scope of this text. Two simple observations: 
For an r-bit checksum, G should be of degree r, because 
otherwise the first bit of the checksum would always be 0, which 
wastes a bit of the checksum. Similarly, the last coefficient 
should be 1 (that is, G should not be divisible by x), because 
otherwise the last bit of the checksum would always be 0 
(because Mx’ = ОС + В, if G is divisible by x, then В must be 
also). The following facts about generator polynomials are 
proved in [PeBr] and/or [Tanen]: 


• If G contains two or more terms, all single-bit errors are 
detected. 


* If G is not divisible by x (that is, if the last term is 1), and 
e is the least positive integer such that G evenly divides 
хе + 1, then all double errors that are within a frame of e 
bits are detected. A particularly good polynomial in this 
respect is x15 + x14 + 1, for which e = 32767. 


e If x + 1 is a factor of С, all errors consisting of an odd 
number of bits are detected. 


* An r-bit CRC checksum detects all burst errors of length 
< г. (A burst error of length г is a string of r bits in 
which the first and last are in error, and the intermediate 


r — 2 bits may or may not be in error.) 


The generator polynomial x + 1 creates a checksum of length 
1, which applies even parity to the message. (Proof hint For 
arbitrary К > 0, what is the remainder when dividing xk by x + 
1?) 

It is interesting to note that if a code of any type can detect 
all double-bit and single-bit errors, then it can in principle 
correct single-bit errors. To see this, suppose data containing a 
single-bit error is received. Imagine complementing all the bits, 
one at a time. In all cases but one, this results in a double-bit 
error, which is detected. But when the erroneous bit is 
complemented, the data is error free, which is recognized. In 
spite of this, the CRC method does not seem to be used for 
single-bit error correction. Instead, the sender is requested to 
repeat the whole transmission if any error is detected. 


14-3 Practice 


Table 14-1 shows the generator polynomials used by some 
common CRC standards. The “Hex” column shows the 
hexadecimal representation of the generator polynomial; the 
most significant bit is omitted, as it is always 1. 


The CRC standards differ in ways other than the choice of 
generating polynomials. Most initialize by assuming that the 
message has been preceded by certain nonzero bits, others do no 
such initialization. Most transmit the bits within a byte least 
significant bit first, some most significant bit first. Most append 
the checksum least significant byte first, others most significant 
byte first. Some complement the checksum. 


CRC-12 is used for transmission of 6-bit character streams, 
and the others are for 8-bit characters, or 8-bit bytes of arbitrary 
data. CRC-16 is used in IBM's BISYNCH communication 
standard. The CRC-CCITT polynomial, also known as ITU-TSS, is 
used in communication protocols such as XMODEM, X.25, IBM's 
SDLC, and ISO's HDLC [Tanen]. CRC-32 is also known as 
AUTODIN-II and ITU-TSS (ITU-TSS has defined both 16- and a 
32-bit polynomials). It is used in PKZip, Ethernet, AAL5 (ATM 
Adaptation Layer 5), FDDI (Fiber Distributed Data Interface), the 
ТЕЕЕ-802 LAN/MAN standard, and in some DOD applications. It 
is the one for which software algorithms are given here. 


The first three polynomials in Table 14-1 have x + lasa 
factor. The last (CRC-32) does not. 


TABLE 14-1. GENERATOR POLYNOMIALS OF SOME CRC CODES 


Common Generator 
Name | 7| — — —— Pobnomal | He | 
CRC-12 E санал хал "T 


CRC-16 | 16 2164 194+ x24] 8005 
CRC-CCITT x16 + x x54] 


CRC-32 32 | x32 + χ26 + х23 + х22 + x16 + x12 + 04C11DB7 
хи +х10 + x8 х7 + x5 + x* Ax? x] 


To detect the error of erroneous insertion or deletion of 
leading 0’s, some protocols prepend one or more nonzero bits to 
the message. These don't actually get transmitted; they are 
simply used to initialize the key register (described below) used 
in the CRC calculation. A value of r 1-bits seems to be 
universally used. The receiver initializes its register in the same 
way. 


The problem of trailing O's is a little more difficult. There 
would be no problem if the receiver operated by comparing the 
remainder based on just the message bits to the checksum 
received. But, it seems to be simpler for the receiver to calculate 
the remainder for all bits received (message and checksum), plus 
г appended O-bits. The remainder should be 0. With a 0 
remainder, if the message has trailing O-bits inserted or deleted, 
the remainder will still be 0, so this error goes undetected. 


The usual solution to this problem is for the sender to 
complement the checksum before appending it. Because this 
makes the remainder calculated by the receiver nonzero 
(usually), the remainder will change if trailing 0’s are inserted or 
deleted. How then does the receiver recognize an error-free 
transmission? 


Using the “mod” notation for remainder, we know that 
(Mxr + В) mod С = 0. 


Denoting the “complement” of the polynomial R by К, we have 


(Мх’ + В) mod G = (Mx' + (x'-! 9x72... + 1 А) md С 
(Мх + ἢ) +х’-1+х!-2+...+1) mod G 


= (xyr-l+yr-2+ ..+ 1) mod G. 


Thus, the checksum calculated by the receiver for an error-free 
transmission should be 


(xr-1 + xr-2 +... + 1) mod С. 


This is a constant (for a given G). For CRC-32 this polynomial, 
called the residual or residue, is 


„Зай р Oo oM EI БЭЛ СЭР ТЭ ст 
yr IER κο ισπ μπι, 
or hex C704DD7B [Black]. 


Hardware 


To develop a hardware circuit for computing the CRC checksum, 
we reduce the polynomial division process to its essentials. 

The process employs a shift register, which we denote by 
CRC. This is of length г (the degree of G) bits, not r + 1 as you 
might expect. When the subtractions (exclusive or's) are done, it 
is not necessary to represent the high-order bit, because the 
high-order bits of G and the quantity it is being subtracted from 
are both 1. The division process might be described informally 
as follows: 

Initialize the CRC register to all 0-bits. 

Get first/next message bit m. 
If the high-order bit of CRC is 1, 


Shift CRC and m together left 1 position, and XOR the 
result with the low-order r bits of G. 


Otherwise, 
Just shift CRC and m left 1 position. 


If there are more message bits, go back to get the next 
one. 


It might seem that the subtraction should be done first, and 
then the shift. It would be done that way if the CRC register held 
the entire generator polynomial, which in bit form is r + 1 bits. 


Instead, the CRC register holds only the low-order r bits of G, so 
the shift is done first, to align things properly. 


The contents of the CRC register for the generator G = x? + 
x + 1 and the message М = x7 + xŠ + x? + x2 + x are shown 
below. Expressed in binary, G = 1011 and M = 11100110. 


000 Initial CRC contents. High-order bit is 0, so just shift in first message bit. 
001 High-order bit is 0, so just shift in second message bit, giving: 

011 High-order bit is 0 again, so just shift in third message bit, giving: 

111 High-order bit is 1, so shift and then XOR with 011, giving: 

101 High-order bit is 1, so shift and then XOR with 011, giving: 

001 High-order bit is 0, so just shift in fifth message bit, giving: 

011 High-order bit is 0, so just shift in sixth message bit, giving: 

lll High-order bit is 1, so shift and then XOR with 011. giving: 

101 There are no more message bits, so this is the remainder. 


These steps can be implemented with the (simplified) circuit 
shown in Figure 14—2, which is known as a feedback shift register. 
The three boxes in the figure represent the three bits of the CRC 
register. When a message bit comes in, if the high-order bit (x2 
box) is 0, simultaneously the message bit is shifted into the x 
box, the bit in x? is shifted to xt, the bit in x! is shifted to х2, 
and the bit in x? is discarded. If the high-order bit of the CRC 
register is 1, then a 1 is present at the lower input of each of the 
two exclusive or gates. When a message bit comes in, the same 
shifting takes place, but the three bits that wind up in the CRC 
register have been exclusive or'ed with binary 011. When all the 
message bits have been processed, the CRC holds M mod G. 


x < x? Message 
Input 


FIGURE 14-2. Polynomial division circuit for G = x3 + x + 
1. 


If the circuit of Figure 14-2 were used for the CRC 
calculation, then after processing the message, r (in this case 3) 
O-bits would have to be fed in. Then the CRC register would 
have the desired checksum, Mx" mod G. There is a way to avoid 
this step with a simple rearrangement of the circuit. 


Instead of feeding the message in at the right end, feed it in 
at the left end, г steps away, as shown in Figure 14-3. This has 
the effect of premultiplying the input message M by xr. But 
premultiplying and  postmultiplying are the same for 
polynomials. Therefore, as each message bit comes in, the CRC 
register contents are the remainder for the portion of the 
message processed, as if that portion had r 0-bits appended. 


Message 


FIGURE 14-3. CRC circuit for С = x3 + x + 1. 


Figure 14-4 shows the circuit for the CRC-32 polynomial. 


Message 
Input 


FIGURE 14-4. CRC circuit for CRC-32. 


Software 


Figure 14-5 shows a basic implementation of CRC-32 in 
software. The CRC-32 protocol initializes the CRC register to all 
15, transmits each byte least significant bit first, and 
complements the checksum. We assume the message consists of 
an integral number of bytes. 


To follow Figure 14-4 as closely as possible, the program 
uses left shifts. This requires reversing each message byte and 


positioning it at the left end of the 32-bit register, denoted byte 
in the program. The word-level reversing program shown in 
Figure 7-1 on page 129 can be used (although this is not very 
efficient, because we need to reverse only eight bits). 


The code of Figure 14-5 is shown for illustration only. It can 
be improved substantially while still retaining its one-bit-at-a- 
time character. First, notice that the eight bits of the reversed 
byte are used in the inner loop's if-statement and then 
discarded. Also, the high-order eight bits of crc are not altered 
in the inner loop (other than by shifting). Therefore, we can set 
crc = crc ^ byte ahead of the inner loop, simplify the if- 
statement, and omit the left shift of byte at the bottom of the 
loop. 


The two reversals can be avoided by shifting right instead of 
left. This requires reversing the hex constant that represents the 
CRC-32 polynomial and testing the least significant bit of crc. 
Finally, the if-test can be replaced with some simple logic, to 
save branches. The result is shown in Figure 14-6. 


Click here to view code image 
unsigned int crc32(unsigned char *message) { 


int i, j; 
unsigned int byte, crc; 


1 = 0; 
crc = OxFFFFFFFF; 
whil (message[i] != 0) { 
byte = message[i]; // Get next byte. 
byte = reverse (byte); // 32-bit reversal. 
for (1 = 0; j = 7; ј++) { // Do eight times. 
if ((int) (cre * byte) < 0) 
crc = (cre << 1) ^ 0x04C11DB7; 
else cre = crc << 1; 
byte = byte << 1; // Ready next msg 
bit 
} 
i el; 


) 


return reverse(~crc); 


FIGURE 14-5. Basic CRC-32 algorithm. 


It is not unreasonable to unroll the inner loop by the full 
factor of eight. If this is done, the program of Figure 14-6 
executes in about 46 instructions per byte of input message. This 
includes a load and a branch. (We rely on the compiler to 
common the two loads of message[i] and to transform the 
while-loop so there is only one branch, at the bottom of the 
loop.) 


Click here to view code image 


unsigned int crc32(unsigned char *message) { 
int ip J} 
unsigned int byte, crc, mask; 


i = 0; 
crc = OxFFFFFFFF; 
whil (message[i] != 0) { 
byte = message[i]; // Get next byte. 
cre = cre ^ byte; 
for (1 = 7; J >= 0; j--) { // Do eight times. 
mask = -(crc & 1); 
crc = (crc >> 1) ^ (OxEDB88320 & mask); 


} 
i — 1+ 1; 
} 


return ~crc; 


FIGURE 14-6. Improved bit-at-a-time CRC-32 algorithm. 


Our next version employs table lookup. This is the usual way 
that CRC-32 is calculated. Although the programs above work 
one bit at a time, the table lookup method (as usually 
implemented) works one byte at a time. A table of 256 fullword 
constants is used. 


The inner loop of Figure 14-6 shifts register crc right eight 
times, while doing an exclusive or operation with a constant 
when the low-order bit of crc is 1. These steps can be replaced 
by a single right shift of eight positions, followed by a single 
exclusive or with a mask that depends on the pattern of 1-bits in 
the rightmost eight bits of the crc register. 


It turns out that the calculations for setting up the table are 
the same as those for computing the CRC of a single byte. The 
code is shown in Figure 14-7. To keep the program self- 


contained, it includes steps to set up the table on first use. In 
practice, these steps would probably be put in a separate 
function to keep the CRC calculation as simple as possible. 
Alternatively, the table could be defined by a long sequence of 
array initialization data. When compiled with GCC to the basic 
RISC, the function executes 13 instructions per byte of input. 
This includes two loads and one branch instruction. 


Faster versions of these programs can be constructed by 
standard techniques, but there is nothing dramatic known to this 
writer. One can unroll loops and do careful scheduling of loads 
that the compiler may not do automatically. One can load the 
message string a halfword or a word at a time (with proper 
attention paid to alignment), to reduce the number of loads of 
the message and the number of exclusive or's of crc with the 
message (see exercise 1). The table lookup method can process 
message bytes two at a time using a table of size 65536 words. 
This might make the program run faster or slower, depending on 
the size of the data cache and the penalty for a miss. 


Click here to view code image 

unsigned int crc32(unsigned char *message) { 
snb. m J; 
unsigned int byte, crc, mask; 


static unsigned int table[256]; 


/* Set up the table, if necessary. */ 


if (table[1] == 0) { 
for (byte = 0; byte <= 255; bytett+) { 

crc = byte; 

for (1 = 7; J >= 0; 3--) { // Do eight times. 
mask = -(crc & 1); 
cre = (crc >> 1) ^ (OxEDB88320 & mask); 

} 

table[byte] = crc; 


} 


/* Through with table setup, now calculate the CRC. */ 


i = 0; 
crc = OxFFFFFFFF; 
while ((byte = message[i]) != O) { 
crc = (crc >> 8) ^ table[(crc ^ byte) & OxFF]; 


return crc; 


FIGURE 14-7. Table lookup CRC algorithm. 


Exercises 


1. Show that if a generator G contains two or more terms, all 
single-bit errors are detected. 


2. Referring to Figure 14-7, show how to code the main loop 
so that the message data is loaded one word at a time. 
For simplicity, assume the message is full-word aligned 
and an integral number of words in length, before the 
zero byte that marks the end of the message. 


Chapter 15. Error-Correcting Codes 


15-1 Introduction 


This section is a brief introduction to the theory and practice of 
error-correcting codes (ECCs). We limit our attention to binary 
forward error-correcting (FEC) block codes. This means that the 
symbol alphabet consists of just two symbols (which we denote 
O and 1), that the receiver can correct a transmission error 
without asking the sender for more information or for a 
retransmission, and that the transmissions consist of a sequence 
of fixed length blocks, called code words. 


Section 15-2 describes the code independently discovered by 
R. W. Hamming and M. J. E. Golay before 1950 [Ham]. This 
code is single error-correcting (SEC), and a simple extension of 
it, also discovered by Hamming, is single error-correcting and, 
simultaneously, double error-detecting (SEC-DED). 


Section 15-4 steps back and asks what is possible in the area 
of forward error correction. Still sticking to binary FEC block 
codes, the basic question addressed is: for a given block length 
(or code length) and level of error detection and correction 
capability, how many different code words can be encoded? 


Section 15-2 is for readers who are primarily interested in 
learning the basics of how ECC works in computer memories. 
Section 15-4 is for those who are interested in the mathematics 
of the subject, and who might be interested in the challenge of 
an unsolved mathematical problem. 


The reader is cautioned that over the past 50 years ECC has 
become a very big subject. Many books have been published on 
it and closely related subjects [Hill, LC, MS, and Roman, to 
mention a few]. Here we just scratch the surface and introduce 
the reader to two important topics and to some of the 
terminology used in this field. Although much of the subject of 
error-correcting codes relies very heavily on the notations and 
results of linear algebra, and, in fact, is a very nice application of 
that abstract theory, we avoid it here for the benefit of those 
who are not familiar with that theory. 


The following notation is used throughout this chapter. The 


terms are defined in subsequent sections. 


m Number of "information" or *message" bits 

k Number of parity-check bits (“check bits,” for short) 
n Code length, n Ξ πι + k 

и Information bit vector, uo, и], ... ит _ 1 

р Parity check bit vector, po, P1, ..., pk 1 

s Syndrome vector, 50, $1, ..., Sk—1 


15-2 The Hamming Code 


Hamming’s development [Ham] is a very direct construction of a 
code that permits correcting single-bit errors. He assumes that 
the data to be transmitted consists of a certain number of 
information bits u, and he adds to these a number of check bits p, 
such that if a block is received that has at most one bit in error, 
then p identifies the bit that is in error (which might be one of 
the check bits). Specifically, in Hamming’s code, p is interpreted 
as an integer that is O if no error occurred, and otherwise is the 
1-origin index of the bit that is in error. Let m be the number of 
information bits, and k the number of check bits used. Because 
the k check bits must check themselves as well as the 
information bits, the value of p, interpreted as an integer, must 
range from 0 to m + k, which is m + k + 1 distinct values. 
Because k bits can distinguish 2k cases, we must have 


2k> m+ k+ 1. (1) 


This is known as the Hamming rule. It applies to any single-error 
correcting (SEC) binary FEC block code in which all of the 
transmitted bits must be checked. The check bits will be 
interspersed among the information bits in a manner described 
below. 


Because p indexes the bit (if any) that is in error, the least 
significant bit of p must be 1 if the erroneous bit is in an odd 
position, and 0 if it is in an even position or if there is no error. 
A simple way to achieve this is to let the least significant bit of 
р, ро, be an even parity check on the odd positions of the block 
and to put po in an odd position. The receiver then checks the 
parity of the odd positions (including that of po). If the result is 
1, an error has occurred in an odd position, and if the result is 0, 
either no error occurred or an error occurred in an even 
position. This satisfies the condition that p should be the index 


of the erroneous bit, or be 0 if no error occurred. 


Similarly, let the next-from-least significant bit of р, рт, be an 
even parity check of positions 2, 3, 6, 7, 10, 11, ... (in binary, 
10, 11, 110, 111, 1010, 1011, ...), and put рі in one of these 
positions. Those positions have a 1 in their second-from-least 
significant binary position number. The receiver checks the 
parity of these positions (including the position of pi). If the 
result is 1, an error occurred in one of those positions, and if the 
result is O, either no error occurred or an error occurred in some 
other position. 


Continuing, the third-from-least significant check bit, po, is 
made an even parity check on those positions that have a 1 in 
their third-from-least significant position number, namely 
positions 4, 5, 6, 7, 12, 13, 14, 15, 20, ..., and p» is put in one of 
those positions. 


Putting the check bits in power-of-two positions (1, 2, 4, 8, 
...) has the advantage that they are independent. That is, the 
sender can compute po independent of pi, p2, ... and, more 
generally, it can compute each check bit independent of the 
others. 


As an example, let us develop a single error-correcting code 
for m — 4. Solving (1) for k gives k — 3, with equality holding. 
This means that all 2k possible values of the k check bits are 
used, so it is particularly efficient. A code with this property is 
called a perfect code.1 


This code is called the (7, 4) Hamming code, which signifies 
that the code length is 7 and the number of information bits is 4. 
The positions of the check bits pi and the information bits ui are 
shown here. 


po |p: |us|p2] 
12 34 


Table 15-1 shows the entire code. The 16 rows show all 16 
possible information bit configurations and the check bits 
calculated by Hamming’s method. 


5 6 4 


To illustrate how the receiver corrects a single-bit error, 
suppose the code word 


1001110 


is received. This is row 4 in Table 15-1 with bit 6 flipped. The 
receiver calculates the exclusive or of the bits in odd positions 
and gets 0. It calculates the exclusive or of bits 2, 3, 6, and 7 and 
gets 1. Lastly, it calculates the exclusive or of bits 4, 5, 6, and 7 
and gets 1. Thus the error indicator, which is called the 
syndrome, is binary 110, or 6. The receiver flips the bit at 
position 6 to correct the block. 


TABLE 15-1. THE (7,4) HAMMING CODE 


0 
1 
1 
0 
0 
l 
| 
0 


А SEC-DED Code 


For many applications, a single error-correcting code would be 
considered unsatisfactory, because it accepts all blocks received. 
A SEC-DED code seems safer, and it is the level of correction and 
detection most often used in computer memories. 


The Hamming code can be converted to a SEC-DED code by 
adding one check bit, which is a parity bit (let us assume even 
parity) on all the bits in the SEC code word. This code is called 
an extended Hamming code [Hill, MS]. It is not obvious that it is 
SEC-DED. To see that it is, consider Table 15-2. It is assumed a 
priori that either 0, 1, or 2 transmission errors occur. As 


indicated in Table 15-2, if there are no errors, the overall parity 
(the parity of the entire n-bit received code word) will be even, 
and the syndrome of the (n - 1) -bit SEC portion of the block 
will be 0. If there is one error, then the overall parity of the 
received block will be odd. If the error occurred in the overall 
parity bit, then the syndrome will be 0. If the error occurred in 
some other bit, then the syndrome will be nonzero and it will 
indicate which bit is in error. If there are two errors, then the 
overall parity of the received block will be even. If one of the 
two errors is in the overall parity bit, then the other is in the 
SEC portion of the block. In this case, the syndrome will be 
nonzero (and will indicate the bit in the SEC portion that is in 
error). If the errors are both in the SEC portion of the block, 
then the syndrome will also be nonzero, although the reason is a 
bit hard to explain. 


TABLE 15-2. ADDING A PARITY BIT TO MAKE A SEC-DED CODE 


Possibilities 


Overall 
Errors Parity Syndrome Receiver Conclusion 


0 even No error. 


Overall parity bit is in error. 
odd | AN EN 
Syndrome indicates the bit in error. 


even I Double error (not correctable). 


The reason is that there must be a check bit that checks one 
of the two bit positions, but not the other one. The parity of this 
check bit and the bits it checks will thus be odd, resulting in a 
nonzero syndrome. Why must there be a check bit that checks 
one of the erroneous bits but not the other one? To see this, first 
suppose one of the erroneous bits is in an even position and the 
other is in an odd position. Then, because one of the check bits 
(po) checks all the odd positions and none of the even positions, 
the parity of the bits at the odd positions will be odd, resulting 
in a nonzero syndrome. More generally, suppose the erroneous 
bits are in positions i and j (with í = j). Then, because the 
binary representations of i and j must differ in some bit position, 
one of them has a 1 at that position and the other has a 0 at that 
position. The check bit corresponding to this position in the 
binary integers checks the bits at positions in the code word that 


have a 1 in their position number, but not the positions that 
have a О in their position number. The bits covered by that 
check bit will have odd parity, and thus the syndrome will be 
nonzero. As an example, suppose the erroneous bits are in 
positions 3 and 7. In binary, the position numbers are 0...0011 
and 0...0111. These numbers differ in the third position from the 
right, and at that position the number 7 has a 1 and the number 
3 has a O. Therefore, the bits checked by the third check bit 
(these are bits 4, 5, 6, 7, 12, 13, 14, 15, ...) will have odd parity. 


Thus, referring to Table 15-2, the overall parity and the 
syndrome together uniquely identify whether 0, 1, or 2 errors 
occurred. In the case of one error, the receiver can correct it. In 
the case of two errors, the receiver cannot tell whether just one 
of the errors is in the SEC portion (in which case it could correct 
it) or both errors are in the SEC portion (in which case an 
attempt to correct it would result in incorrect information bits). 


The overall parity bit could as well be a parity check on only 
the even positions, because the overall parity bit is easily 
calculated from that and the parity of the odd positions (which 
is the least significant check bit). More generally, the overall 
parity bit could as well be a parity check on the complement set 
of bits checked by any one of the SEC parity bits. This 
observation might save some gates in hardware. 


It should be clear that the Hamming SEC code has minimum 
redundancy. That is, for a given number of information bits, it 
adds a minimum number of check bits that permit single error 
correction. This is so because by construction, just enough check 
bits are added so that when interpreted as an integer, they can 
index any bit in the code, with one state left over to denote “no 
errors.” In other words, the code satisfies inequality (1). 
Hamming shows that the SEC-DED code constructed from a SEC 
code by adding one overall parity bit is also of minimum 
redundancy. His argument is to assume that a SEC-DED code 
exists that has fewer check bits, and he derives from this a 
contradiction to the fact that the starting SEC code had 
minimum redundancy. 


Minimum Number of Check Bits Required 


The middle column of Table 15-3 shows minimal solutions of 
inequality (1) for a range of values of m. The rightmost column 


simply shows that one more bit is required for a SEC-DED code. 
From this table one can see, for example, that to provide the 
SEC-DED level ECC for a memory word containing 64 
information bits, eight check bits are required, giving a total 
memory word size of 72 bits. 


TABLE 15-3. EXTRA BITS FOR ERROR CORRECTION/DETECTION 


Number of 
Information k for 
Bits т SEC-DED 


2 to 4 
Stoll 
12 to 26 


27 to 57 


58 to 120 
121 to 247 
248 to 502 


Concluding Remarks 


In the more mathematically oriented ECC literature, the term 
“Hamming code” is reserved for the perfect codes described 
above—that is, those with (п, m) = (3, 1), (7, 4), (15, 11), (31, 
26), and so on. Similarly, the extended Hamming codes are the 
perfect SEC-DED codes described above. Computer architects 
and engineers often use the term to denote any of the codes that 
Hamming described, and some variations. The term “extended” 
is often understood. 


The first IBM computer to use Hamming codes was the IBM 
Stretch computer (model 7030), built in 1961 [LC]. It used a 
(72, 64) SEC-DED code (not a perfect code). A follow-on 
machine known as Harvest (model 7950), built in 1962, was 
equipped with 22-track tape drives that employed a (22, 16) 
SEC-DED code. The ECCs found on modern machines are usually 
not Hamming codes, but rather are codes devised for some 


logical or electrical property, such as minimizing the depth of 
the parity check trees, and making them all the same length. 
Such codes give up Hamming's simple method of determining 
which bit is in error, and instead use a hardware table lookup. 


At the time of this writing (2012), most notebook PCs 
(personal computers) have no error checking in their memory 
systems. Desktop PCs may have none, or they may have a simple 
parity check. Server-class computers generally have ECC at the 
SEC-DED level. 


In the early solid-state computers equipped with ECC 
memory, the memory was usually in the form of eight check bits 
and 64 information bits. A memory module (group of chips) 
might be built from, typically, nine 8-bit-wide chips. A word 
access (72 bits, including check bits) fetches eight bits from each 
of these nine chips. Each chip is laid out in such a way that the 
eight bits accessed for a single word are physically far apart. 
Thus, a word access references 72 bits that are physically 
somewhat separated. With bits interleaved in that way, if a few 
close-together bits in the same chip are altered, as, for example, 
by an alpha particle or cosmic ray hit, a few words will have 
single-bit errors, which can be corrected. Some larger memories 
incorporate a technology known as Chipkill This allows the 
computer to continue to function even if an entire memory chip 
fails, for example, due to loss of power to the chip. 


The interleaving technique can be used in communication 
applications to correct burst errors by interleaving the bits in 
time. 


Today the organization of ECC memories is often more 
complicated than simply having eight check bits and 64 
information bits. Modern server memories might have 16 or 32 
information bytes (128 or 256 bits) checked as a single ECC 
word. Each DRAM chip may store two, three, or four bits in 
physically adjacent positions. Correspondingly, ECC is done on 
alphabets of four, eight, or 16 characters—a subject not 
discussed here. Because the DRAM chips usually come in 8- or 
16-bit-wide configurations, the memory module often provides 
more than enough bits for the ECC function. The extra bits might 
be used for other functions, such as one or two parity bits on the 
memory address. This allows the memory to check that the 
address it receives is (probably) the address that the CPU 


generated. 


In modern server-class machines, ECC might be used in 
different levels of cache memory, as well as in main memory. It 
might also be used in non-memory areas, such as on busses. 


15-3 Software for SEC-DED on 32 Information Bits 


This section describes a code for which encoding and decoding 
can be efficiently implemented in software for a basic RISC. It 
does single error correction and double error detection on 32 
information bits. The technique is basically Hamming's. 


We follow Hamming in using check bits in such a way that 
the receiver can easily (in software) determine whether zero, 
one, or two errors occurred, and if one error occurred it can 
easily correct it. We also follow Hamming in using a single 
overall parity bit to convert a SEC code to SEC-DED, and we 
assume the check bit values are chosen to make even parity on 
the check bit and the bits it checks. A total of seven check bits 
are required (Table 15-3). 


Consider first just the SEC property, without DED. For SEC, 
six check bits are required. For implementation in software, the 
main difficulty with Hamming's method is that it merges the six 
check bits with the 32 information bits, resulting in a 38-bit 
quantity. We are assuming the implementation is done on a 32- 
bit machine, and the information bits are in a 32-bit word. It 
would be very awkward for the sender to spread out the 
information bits over a 38-bit quantity and calculate the check 
bits into the positions described by Hamming. The receiver 
would have similar difficulties. The check bits could be moved 
into a separate word or register, with the 32 information bits 
kept in another word or register. But this gives an irregular 
range of positions that are checked by each check bit. In the 
scheme to be described, these ranges retain most of the 
regularity that they have in Hamming's scheme (which ignores 
word boundaries). The regularity leads to simplified 
calculations. 


The positions checked by each check bit are shown in Table 
15-4. In this table, bits are numbered in the usual little-endian 
way, with position 0 being the least significant bit (unlike 
Hamming's numbering). 


TABLE 15-4. POSITIONS CHECKED BY THE CHECK BITS 


O 1. 3.5 79211. αἱ 
0, 2-3, 6-7, 10-11, ..., 30-31 
0, 4-7, 12-15, 20-23, 28-31 


0, 8-15, 24-31 
0, 16-31 


1-31 


Observe that each of the 32 information word bit positions is 
checked by at least two check bits. For example, position 6 is 
checked by pı and p» (and also by ps). Thus, if two information 
words differ in one bit position, the code words (information 
plus check bits) differ in at least three positions (the information 
bit that was corrupted and two or more check bits), so the code 
words are at a distance of at least three from one another (see 
^Hamming Distance" on page 343). Furthermore, if two 
information words differ in two bit positions, then at least one of 
po - ps checks one of the positions, but not the other, so again 
the code words will be at least a distance of three apart. 
Therefore, the above scheme represents a code with minimum 
distance three (a SEC code). 


Suppose a code word is transmitted to a receiver. Let u 
denote the information bits received, p denote the check bits 
received, and s (for syndrome) denote the exclusive or of p and 
the check bits calculated from u by the receiver. Then, 
examination of Table 15-4 reveals that s will be set as shown in 
Table 15-5, for zero or one errors in the code word. 


TABLE 15-5. SYNDROME FOR ZERO OR ONE ERRORS 


Resulting Syndrome 
$5 ... 50 


(no errors) 000000 
011111 


100001 
100010 
100011 
100100 


111110 
111111 
000001 
000010 
000100 
001000 
010000 
100000 


As ап example, suppose information bit u4 is corrupted in 
transmission. Table 15-4 shows that u4 is checked by check bits 
Ρ2 and ps. Therefore, the check bits calculated by the sender and 
receiver will differ in p2 and ps. In this scenario the check bits 
received are the same as those transmitted, so the syndrome will 
have bits 2 and 5 set—that is, it will be 100100. 


If one of the check bits is corrupted in transmission (and no 
errors occur in the information bits) then the check bits 
received and those calculated by the receiver (which equal those 
calculated by the sender) differ in the check bit that was 
corrupted, and in no other bits, as shown in the last six rows of 
Table 15-5. 


The syndromes shown in Table 15-5 are distinct for all 39 
possibilities of no error or a single-bit error anywhere in the 
code word. Therefore, the syndrome identifies whether or not an 


error occurred, and if so, which bit position is in error. 
Furthermore, if a single-bit error occurred, it is fairly easy to 
calculate which bit is in error (without resorting to a table 
lookup) and to correct it. Here is the logic: 


If s = O, no error occurred. 

If s = 011111, uo is in error. 

If s = 1xxxxx, with xxxxx nonzero, the error is in u at 
position ххххх. 


Otherwise, a single bit in s is set, the error is in a check bit, 
and the correct check bits are given by the exclusive or of the 
syndrome and the received check bits (or by the calculated 
check bits). 


Under the assumption that an error in the check bits need not 
be corrected, this can be expressed as shown here, where b is the 
bit number to be corrected. 


if (s & (s — 1)) = 0 then... // No correction required. 
else do 

ifs = 05011111 then 5 < 0 

else b «— s & 05011111 

u«-uGí(l-»b) // Complement bit b of u. 
end 


There is a hack that changes the second if-then-else 
construction shown above into an assignment statement. 


To recognize double-bit errors, an overall parity bit is 
computed (parity of из1:0 and рь:0), and put in bit position 6 of 
p for transmission. Double-bit errors are distinguished by the 
overall parity being correct, but with the syndrome (55:0) being 
nonzero. The reason the syndrome is nonzero is the same as in 
the case of the extended Hamming code, given on page 334. 


Software that implements this code is shown in Figures 15-1 
and 15-2. We assume the simple case of a sender and a receiver, 
and the receiver has no need to correct an error that occurs in 
the check bits or in the overall parity bit. 


Click here to view code image 


unsigned int checkbits(unsigned int u) { 


/* Computes the six parity check bits for the 


"information" bits given in the 32-bit word u. The 
check bits are p[5:0]. On sending, an overall parity 
bit will be prepended to p (by another process). 


Bit Checks these bits of u 


pí0] 0, 1, 3, 5, ..., 31 (0 and the odd positions). 
pí1] 0, 2-3, 6-7, ..., 30-31 (0 and positions ххх1х). 
р(21 0, 4-7, 12-15, 20-23, 28-31 (0 and posns хх1хх). 
p[3] 0, 8-15, 24-31 (0 and positions x1xxx). 

р[4] 0, 16-31 (0 and positions 1xxxx). 

p[5] 1-31 */ 


unsigned int ро, pl, p2, p3, p4, p5, рб, Ὁ; 
unsigned int tl, t2, t3; 


// First calculate p[5:0] ignoring u[0]. 

pO =u ^ (а >> 2); 

pO = pO ^ (pO >> 4); 

pO = pO ^ (pO >> 8); 

pO = pO ^ (p0 >> 16); // p0 is in posn 1. 


tl =u ^ (и >> 1); 
pl = tl * (tl >> 4) 
pl = pl ^ (pl >> 8); 

16); // pl is in posn 2. 


pl = pl ^ (pl >> 

t2 = tl ^ (tl 2» 2); 

p2 = t2 ^ (t2 >> 8); 

p2 = p2 * (p2 >> 16); // p2 is in posn 4. 


τὸ = +2 ^ (t2 >> 4); 


рз = t3 ^ (t3 >> 16); // РЗ is in розп 8. 

р4 = t3 ^ (t3 >> 8) // p4 is in posn 
16. 

р5 = p4 ^ (p4 >> 16); // p5 is in posn 0. 

р = ((р0>>1) & 1) | ((р1>>1) & 2) | ((p2>>2) & 4) | 

((p3>>5) & 8) | ((p4>>12) & 16) | ((р5 & 1) << 5); 

p-p^ (-(u& 1) & Ox3F); // Now account for 

u[0]. 


return p; 


FIGURE 15-1. Calculation of check bits. 


Click here to view code image 


int correct(unsigned int pr, 


unsigned int *ur) { 


/* This function looks at th 
bits and 32 information bits 


with 0, 1, or 2, 
two errors occurred. 
received (ur) 
unsigned int po, p, syn, b; 
po = parity(pr ^ *ur); 
parity 
p = checkbits (*ur); 
info. 
syn = p ^ (pr & 0х3Е), 
of 
if (ро == 0) í 
if (syn == 0) return 0; 
0. 
else return 2; 
} 
if (((syn - 1) & syn) == 0) 
one 


return 1; 


// One error, and syn bits 


b = syn - 31 - (syn >> 5); 
31. 
// if (syn == 0х1Е) bD = 0; 727 
// else b = syn 8 0х1Ё 

*ur = *ur ^ (1 << р); 

return 1; 


// to the on 
// Correct the bit. 


received seven check 
(pr and ur) 
determines how many errors occurred 
presumption that it must be O0, 
meaning that no errors, 


and 

(under the 

1, or 2). It returns 
one error, or 


It corrects the information word 
if there was one error in it. */ 


// Compute overall 
// of the received data. 
// Calculate check bits 

// for the received 


// Syndrome (exclusive 


// overall parity bit). 


// If no errors, return 


// Two errors, return 2. 
// One error occurred. 
// If syn has zero or 


// bits set, then the 

// exror is in the check 
// bits or the overall 
// parity bit (no 

// correction required). 


5:0 tell where it is in ur. 


// Map syn to range 0 to 


(These two lines equiv. 
line above.) 


FIGURE 15-2. The receiver's actions. 


To compute the check bits, function checkbits first ignores 


information bit ио and computes 


0. (x — u © (и> 1)) 
1. (x — x (х5 2) 
2. (x — x € (x > 4)) 
3. (x — x @ (x 58) 


4 (x — x@(x 4 16)) 

except omitting line i when computing check bit ki, for 0 < i < 
4. This puts pi in various positions of word x, as shown in Figure 
15-1. For ps, all the above assignments are used. This is where 
the regularity of the pattern of bits checked by each check bit 
pays off; a lot of code commoning can be done. This reduces 
what would be 4x5 + 5 = 25 such assignments to 15, as 
shown in Figure 15-1. 


Incidentally, if the computer has an instruction for computing 
the parity of a word, or has the population count instruction 
(which puts the word parity in the least significant bit of the 
target register), then the regular pattern is not needed. On such 
a machine, the check bits might be computed as 


Click here to view code image 


pop(u ^ OxAAAAAAAB) 
pop(u & OxCCCCCCCD) 


po 
pl 


& 1; 
& 1; 
and so forth. 


After packing the six check bits into a single quantity », the 
checkbits function accounts for information bit оо by 
complementing all six check bits if ug = 1. (See Table 15-4; p5 
must be complemented because uo was erroneously included in 
the calculation of ps up to this point.) 


15-4 Error Correction Considered More Generally 


This section continues to focus on the binary FEC block codes, 
but a little more generally than the codes described in Section 
15-2. We drop the assumption that the block consists of a set of 
“information” bits and a distinct set of “check” bits, and any 
implication that the number of code words must be a power of 


2. We also consider levels of error correction and detection 
capability greater than SEC and SEC-DED. For example, suppose 
you want a double error-correcting code for a binary 
representation of decimal digits. If the code has 16 code words 
(with ten being used to represent the decimal digits and six 
being unused), the length of the code words must be at least 11 
bits. But if a code with only 10 code words is used, the code 
words can be of length 10 bits. (This is shown in Table 15-8 on 
page 351, in the column for d — 5, as is explained below.) 


A code is simply a set of code words, and for our purposes the 
code words are binary strings all of the same length which, as 
mentioned above, is called the code length. The number of code 
words in the set is called the code size. We make no 
interpretation of the code words; they might represent 
alphanumeric characters or pixel values in a picture, for 
example. 


As a trivial example, a code might consist of the binary 
integers from 0 to 7, with each bit repeated three times: 


1000000000, 000000111, 000111000, 000111111, 111000000, 
... 111111111). 


Another example is the two-out-of-five code, in which each 
code word has exactly two 1-bits: 


100011, 00101, 00110, 01001, 01010, 01100, 10001, 10010, 
10100, 11000}. 


The code size is 10, and thus it is suitable for representing 
decimal digits. Notice that if code word 00110 is considered to 
represent decimal 0, then the remaining values can be decoded 
into digits 1 through 9 by giving the bits weights of 6, 3, 2, 1, 
and 0, in left-to-right order. 


The code rate is a measure of the efficiency of a code. For a 
code like Hamming’s, this can be defined as the number of 
information bits divided by the code length. For the Hamming 
code discussed above, it is 4/7 ~ 0.57. More generally, the code 
rate is defined as the log base 2 of the code size, divided by the 
code length. The simple codes above have rates of log2(8)/9 = 
0.33 and 1082(10)/5 = 0.66, respectively. 


Hamming Distance 


The central concept in the theory of ECC is that of Hamming 
distance. The Hamming distance between two words (of equal 
length) is the number of bit positions in which they differ. Put 
another way, it is the population count of the exclusive or of the 
two words. It is appropriate to call this a distance function 
because it satisfies the definition of a distance function used in 
linear algebra: 


d(x,y) = d(y, x), 
d(x, y) > 0, 
d(x,y) = 0 iff x = y, and 
d(x,y) + d(y, 2) 2 d(x,z) (triangle inequality). 
Here d(x, y) denotes the Hamming distance between code words 


x and y, which for brevity we will call simply the distance 
between x and y. 


Suppose a code has a minimum distance of 1. That is, there 
are two words x and y in the set that differ in only one bit 
position. Clearly, if x were transmitted and the bit that makes it 
distinct from y were flipped due to a transmission error, then the 
receiver could not distinguish between receiving x with a certain 
bit in error and receiving y with no errors. Hence in such a code 
it is impossible to detect even a 1-bit error, in general. 


Suppose now that a code has a minimum distance of 2. Then 
if just one bit is flipped in transmission, an invalid code word is 
produced, and thus the receiver can (in principle) detect the 
error. If two bits are flipped, a valid code word might be 
transformed into another valid code word. Thus, double-bit 
errors cannot be detected. Furthermore, single-bit errors cannot 
be corrected. This is because if a received word has one bit in 
error, then there may be two code words that are one bit-change 
away from the received word, and the receiver has no basis for 
deciding which is the original code word. 


The code obtained by appending a single parity bit is in this 
category. It is shown below for the case of three information bits 
(m — 3). The rightmost bit is the parity bit, chosen to make even 
parity on all four bits. The reader may verify that the minimum 
distance between code words is 2. 


0000 
0011 
0101 
0110 
1001 
1010 
1100 
1111 
Actually, adding a single parity bit permits detecting any odd 
number of errors, but when we say that a code permits detecting 
k-bit errors, we mean all errors up to k bits. 


Now consider the case in which the minimum distance 
between code words is 3. If any one or two bits is flipped in 
transmission, an invalid code word results. If just one bit is 
flipped, the receiver can (we imagine) try flipping each of the 
received bits one at a time, and in only one case will a code 
word result. Hence in such a code the receiver can detect and 
correct a single-bit error. A double-bit error might appear to be a 
single-bit error from another code word, and thus the receiver 
cannot detect double-bit errors. 


Similarly, it is easy to reason that if the minimum distance of 
a code is 4, the receiver can correct all single-bit errors and 
detect all double-bit errors (it is a SEC-DED code). As mentioned 
above, this is the level of capability often used in computer 
memories. 


Table 15-6 summarizes the error-correction and -detection 
capabilities of a block code based on its minimum distance. 


TABLE 15-6. NUMBER OF BITS CORRECTED/DETECTED 


Minimum 
Distance 


3 
L(d - 1)/2 | 


Error-correction capability can be traded for error detection. 
For example, if the minimum distance of a code is 3, that 
redundancy can be used to correct no errors but to detect single- 
or double-bit errors. If the minimum distance is 5, the code can 
be used to correct single-bit errors and detect 3-bit errors, or to 
correct no errors but to detect 4-bit errors, and so forth. 
Whatever is subtracted from the *Correct" column of Table 15-6 
can be added to the “Detect” column. 


The Main Coding Theory Problem 


Up to this point we have asked, “Given a number of information 
bits m and a desired minimum distance d, how many check bits 
are required?" In the interest of generality, we will now turn this 
question around and ask, "For a given code length n and 
minimum distance d, how many code words are possible?" Thus, 
the number of code words need not be an integral power of 2. 


Following [Roman] and others, let A(n, d) denote the largest 
possible code size for a (binary) code with length n and 
minimum distance d. The remainder of this section is devoted to 
exploring some of what is known about this function. 
Determining its values has been called the main coding theory 
problem [Hill, Roman]. Throughout this section we assume that n 
> d> 1. 


It is nearly trivial that 


A(n, 1) = 2", (2) 
because there are 2n distinct words of length n. 


For minimum distance 2, we know from the single parity bit 
example that A(n, 2) > 2r-1. But A(n, 2) cannot exceed 2n-1 for 
the following reason. Suppose there is a code of length n and 
minimum distance 2 that has more than 2n-1 code words. Delete 
any one column from the code words. (We envision the code 
words as being arranged in a matrix much like that of Table 15- 
1 on page 333.) This produces a code of length n - 1 and 
minimum distance at least 1 (deleting a column can reduce the 
minimum distance by at most 1), and of size exceeding 2n - 1. 
Thus, it has A(n - 1, 1) > 2n- 1, contradicting Equation (2). 
Hence, 


A(n ,2) = 2n- !. 


That was not difficult. What about A(n, 3)? That is an 
unsolved problem, in the sense that no formula or reasonably 
easy means of calculating it is known. Of course, many specific 
values of A (n, 3) are known, and some bounds are known, but 
the exact value is unknown in most cases. 


When equality holds in (1), it represents the solution to this 
problem for the case d = 3. Letting n = m + k, (1) can be 
rewritten 


2" < 1 
п + | 
Here, m is the number of information bits, so 2m is the maximum 
number of code words. Hence, we have 


(3) 


Qn 
A(n, 3) < ——. 
n+] 
with equality holding when 2n/(n + 1) is an integer (by 
Hamming’s construction). 


For n = 7, this gives A(7, 3) = 16, which we already know 
from Section 15-2. For n = 3 it gives А (3, 3) < 2, and the limit 
of 2 can be realized with code words 000 and 111. Forn — 4 it 
gives A (4, 3) « 3.2, and with a little doodling you will see that 
it is not possible to get three code words of length 4 with d — 3. 
Thus, when equality does not hold in (3), it merely gives an 


upper bound, quite possibly not realizable, on the maximum 
number of code words. 


An interesting relation is that for n = 2, 


A(n, d) € 2A(n — 1, d). (4) 


Therefore, adding 1 to the code length at most doubles the 
number of code words possible for the same minimum distance 
d. To see this, suppose you have a code of length n, distance d, 
and size A(n, d). Choose an arbitrary column of the code. Either 
half or more of the code words have a 0 in the selected column, 
or half or more have a 1 in that position. Of these two subsets, 
choose one that has at least A(n, d)/2 code words, form a new 
code consisting of this subset, and delete the selected column 
(which is either all O's or all 1's). The resulting set of code words 
has n reduced by 1, has the same distance d, and has at least A 
(n, d)/2 code words. Thus, A(n - 1, d) = A (n, d)/ 2, from which 
inequality (4) follows. 


A useful relation is that if d is even, then 


A(n, d) = A(n—l,d- 1). (5) 


To see this, suppose you have a code C of length n and minimum 
distance d, with d odd. Form a new code by appending to each 
word of C a parity bit, let us say to make the parity of each word 
even. The new code has length n + 1 and has the same number 
of code words as does C. It has minimum distance d + 1. For if 
two words of C are a distance x apart, with x odd, then one 
word must have even parity and the other must have odd parity. 
Thus, we append a 0 in the first case and a 1 in the second case, 
which increases the distance between the words to x + 1. If x is 
even, we append a 0 to both words, which does not change the 
distance between them. Because d is odd, all pairs of words that 
are a distance d apart become distance d + 1 apart. The distance 
between two words more than d apart either does not change or 
increases. Therefore the new code has minimum distance d + 1. 
This shows that if d is odd, then A(n+ 1, d + 1) = A(n, d), or, 
equivalently, A(n, d) 2 A(n - 1, d - 1) for even d = 2. 

Now suppose you have a code of length n and minimum 


distance d 2 2 (d can be odd or even). Form a new code by 
eliminating any one column. The new code has length n - 1, 


minimum distance at least d - 1, and is the same size as the 
original code (all the code words of the new code are distinct 
because the new code has minimum distance at least 1). 
Therefore A(n - 1, d - 1) = A (n, d). This establishes Equation 
(5). 


Spheres 


Upper and lower bounds on A(n, d), for any d = 1, can be 
derived by thinking in terms of n-dimensional spheres. Given a 
code word, think of it as being at the center of a “sphere” of 
radius r, consisting of all words at a Hamming distance r or less 
from it. 


How many points (words) are in a sphere of radius r? First, 
consider how many points are in the shell at distance exactly r 
from the central code word. This is given by the number of ways 
to choose r different items from n, ignoring the order of choice. 
We imagine the r chosen bits as being complemented to form a 
word at distance exactly r from the central point. This “choice” 

(т) 
function, often written «7, can be calculated from2 
| m _ n! 
Ἡ, r(n-r) 
_ n(n- 1) (m _ n(n—-l)(n-2) 
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Thus ο  ,XU 
, and so forth. 


2 , «3, 6 
The total number of points in a sphere of radius r is the sum 
of the points in the shells from radius O to r: 


ξ (n 


(20 (87, 
There seems to be no simple formula for this sum [Knul ]. 


From this it is easy to obtain bounds on A(n, d). First, assume 
you have a code of length n and minimum distance d, and it 
consists of M code words. Surround each code word with a 
sphere, all of the same maximal radius such that no two spheres 
have a point in common. This radius is (d — 1)/2 if d is odd, and 
is (d — 2)/2 if d is even (see Figure 15-3). Because each point is 
in at most one sphere, the total number of points in the M 
spheres must be less than or equal to the total number of points 


in the space. That is, 


| (4- 1)/21/ e 
м 1 (") S Ən. 
i= 0 TP 


This holds for any M, hence for M = A(n, d), so that 


?n 
A(n, d) < - 
KC] 1) 216.) 
x Wal 
i"0 xE 
This is known as the sphere-packing bound, or the Hamming 


bound. 


Each large dot represents a code word, and each small dot represents a non-code 
word a unit distance away from its neighbors. 


FIGURE 15-3. Maximum radius that allows correcting points 
within a sphere. 


The sphere idea also easily gives a lower bound on A(n, d). 
Assume again that you have a code of length n and minimum 
distance d, and it has the maximum possible number of code 
words—that is, it has A(n, d) code words. Surround each code 
word with a sphere of radius d – 1. Then these spheres must 
cover all 2n points in the space (possibly overlapping). For if not, 
there would be a point that is at a distance d or more from all 
code words, and that is impossible because such a point would 
be a code word. Thus, we have a weak form of the Gilbert- 
Varshamov bound: 


Кет, 
A(n, d) У |") 2 2", 
i=O\V 
There is the strong form of the G-V bound, which applies to 
linear codes. Its derivation relies on methods of linear algebra 
which, important as they are to the subject of linear codes, are 


not covered in this short introduction to error-correcting codes. 
Suffice it to say that a linear code is one in which the sum 
(exclusive or) of any two code words is also a code word. The 
Hamming code of Table 15-1 is a linear code. Because the G-V 
bound is a lower bound on linear codes, it is also a lower bound 
on the unrestricted codes considered here. For large n, it is the 
best known lower bound on both linear and unrestricted codes. 


The strong G-V bound states that A(r, d) 2 2m, where m is 
the largest integer such that 


That is, it is the value of the right-hand side of this inequality 
rounded down to the next strictly smaller integral power of 2. 
The “strictness” is important for cases such as (r, d) = (8, 3), 
(16, 3) and (the degenerate case) (6, 7). 


Combining these results: 


( \ 
2 2" 
f- n < A(n, d) < [q> η; 

гох ! Z i=0 W 
where GP2LT denotes the greatest integral power of 2 (strictly) 
less than its argument. 


GP2LT (6) 


Table 15-7 gives the values of these bounds for some small 
values of n and d. A single number in an entry means the lower 
and upper bounds given by (6) are equal. 


TABLE 15-7. THE G - V AND HAMMING BOUNDS ON A(n, d) 


Toe (455 Tes espe 
6 4-5 5 Е 1 — [ls 


315 
2048 
8192 — 
13797 
65536 — 
95325 
219 _ 
671088 


222 


If d is even, bounds сап be computed directly from (6) ог, 
making use of Equation (5), they can be computed from (6) with 
d replaced with d - 1 and n replaced with n – 1 in the two 
bounds expressions. It turns out that the latter method always 
results in tighter or equal bounds. Therefore, the entries in Table 
15-7 were calculated only for odd d. To access the table for even 
d, use the values of d shown in the heading and the values of n 
shown at the left. 


The bounds given by (6) can be seen to be rather loose, 
especially for large d. The ratio of the upper bound to the lower 
bound diverges to infinity with increasing n. The lower bound is 
particularly loose. Over a thousand papers have been written 
describing methods to improve these bounds, and the results as 
of this writing are shown in Table 15-8 [Agrell, Brou; where 
they differ, Table 15-8. shows the tighter bounds]. 


TABLE 15-8. BEST KNOWN BOUNDS ON A(n, d) 
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20480 — 
26168 
36864 — 
43688 
73728 — 
87376 

147456- | 8192- 
4 
173015 13674 PN 
294912- | 16384- 
344308 24106 
219.. 16384 — 
599184 47538 
220 32768 — 
1198368 84260 


2048 — 2279 


2560 — 4096 


4096 — 6941 1024 


4096 128 — 268 
4096 — 5421| 192 — 466 


4104 — 9672 


65536 8192 


512 — 1585 
157285 17768 | 512” 158: 


131072 — 16384 — 
291269 32151 
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The cases of (n, d) — (7, 3), (15, 3), and (23, 7) are perfect 
codes, meaning that they achieve the upper bound given by (6). 
This definition is a generalization of that given on page 333. The 
codes for which n is odd and n = d are also perfect; see exercise 
8. 


We conclude this chapter by pointing out that the idea of 
minimum distance over an entire code, which leads to the ideas 


1024 — 3170 


of p-bit error detection and q-bit error correction for some p and 
q, is not the only criterion for the *power" of a binary FEC block 
code. For example, work has been done on codes aimed at 
correcting burst errors. [Etzion] has demonstrated a (16, 11) 
code, and others, that can correct any single-bit error and any 
error in two consecutive bits, and is perfect, in a sense not 
discussed here. It is not capable of general double-bit error 
detection. The (16, 11) extended Hamming code is SEC-DED and 
is perfect. Thus, his code gives up general double-bit error 
detection in return for double-bit error correction of consecutive 
bits. This is, of course, interesting because in many applications 
errors are likely to occur in short bursts. 


Exercises 


1. Show a Hamming code for m — 3 (make a table similar to 
Table 15-1). 


2. In a certain application of an SEC code, there is no need to 
correct the check bits. Hence the k check bits need only 
check the information bits, but not themselves. For m 
information bits, k must be large enough so that the 
receiver can distinguish m + 1 cases: which of the m bits 
is in error, or no error occurred. Thus, the number of 
check bits required is given by 2k 2 m + 1. This is a 
weaker restriction on k than is the Hamming rule, so it 
should be possible to construct, for some values of m, an 
SEC code that has fewer check bits than those required by 
the Hamming rule. Alternatively, one could have just one 
value to signify that an error occurred somewhere in the 
check bits, without specifying where. This would lead to 
the rule 2k 2 m + 2. 


What is wrong with this reasoning? 


3. (Brain teaser) Given m, how would you find the least k 
that satisfies inequality (1)? 


4. Show that the Hamming distance function for any binary 
block code satisfies the triangle inequality: if x and y are 
code vectors and d(x, y) denotes the Hamming distance 
between them, then 


d (x, z) < d (x, y) + d (y, 2). 


5. Prove: A(2n, 2d) = A(n, d). 
6. Prove the “singleton bound”: A(n, d) x 2n- d + 1. 


7. Show that the notion of a perfect code as equality in the 
right-hand portion of inequality (6) is a generalization of 
the Hamming rule. 


8. What is the value of A(n, d) i£ n = d? Show that for odd n, 


these codes are perfect. 


9. Show that if n is a multiple of 3 and d = 2n/3, then A(n, 


d) = 4. 


10. Show that if d > 2n/3, then A(n, d) = 2. 


11. A two-dimensional 


parity check scheme for 64 


information bits arranges the information bits ир... 463 
into an 8 x 8 array, and appends a parity bit to each row 
and column as shown below. 


Πο 


Идо ++ 


sse 


(Ha 


и, 


The гї are parity check bits on the rows, апа the ci are parity 
check bits on the columns. The “corner” check bit could be 
parity check on the row or the column of check bits (but not 
both); it is shown as a check on the bottom row (check bits co 


through су). 


Comment on this scheme. In particular, is it SEC-DED? Is its 
error-detection and -correction capability significantly altered if 
the corner bit rg is omitted? Is there any simple relation between 
the value of the corner bit if it's a row sum or a column sum? 


Chapter 16. Hilbert's Curve 


In 1890, Giuseppe Peano discovered a planar curvei with the 
rather surprising property that it is “space-filling.” The curve 
winds around the unit square and hits every point (x, y) at least 
once. 


Peano's curve is based on dividing each side of the unit 
square into three equal parts, which divides the square into nine 
smaller squares. His curve traverses these nine squares in a 
certain order. Then, each of the nine small squares is similarly 
divided into nine still smaller squares, and the curve is modified 
to traverse all these squares in a certain order. The curve can be 
described using fractions expressed in base 3; in fact, that's the 
way Peano first described it. 


In 1891, David Hilbert [Hil] discovered a variation of Peano's 
curve based on dividing each side of the unit square into two 
equal parts, which divides the square into four smaller squares. 
Then, each of the four small squares is similarly divided into 
four still smaller squares, and so on. For each stage of this 
division, Hilbert gives a curve that traverses all the squares. 
Hilbert's curve, sometimes called the “Peano-Hilbert curve,” is 
the limit curve of this division process. It can be described using 
fractions expressed in base 2. 


Figure 16-1 shows the first three steps in the sequence that 
leads to Hilbert's space-filling curve, as they were depicted in his 
1891 paper. 


ый pereo |] 


FIGURE 16-1. First three curves in the sequence defining 
Hilbert's curve. 


Here, we do things a little differently. We use the term 
“Hilbert curve" for any of the curves on the sequence whose 


limit is the Hilbert space-filling curve. The "Hilbert curve of 
order n" means the nth curve in the sequence. In Figure 16-1, 
the curves are of order 1, 2, and 3. We shift the curves down and 
to the left so that the corners of the curves coincide with the 
intersections of the lines in the boxes above. Finally, we scale 
the size of the order n curve up by a factor of 2n, so that the 
coordinates of the corners of the curves are integers. Thus, our 
order n Hilbert curve has corners at integers ranging from 0 to 
2n – 1 in both x and y. We take the positive direction along the 
curve to be from (x, y) = (0, 0) to (2n - 1.0). Figure 16-2 shows 
the Hilbert curves of orders 1 through 6. 


16-1 A Recursive Algorithm for Generating the 
Hilbert Curve 


To see how to generate a Hilbert curve, examine the curves in 
Figure 16-2. The order 1 curve goes up, right, and down. The 
order 2 curve follows this overall pattern. First, it makes a U- 
shaped curve that goes up, in net effect. Second, it takes a unit 
step up. Third, it takes a U-shaped curve, a step, and another U, 
all to the right. Finally, it takes a step down, followed by a U 
that goes down, in net effect. 
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FIGURE 16-2. Hilbert curves of orders 1-6. 


The order 1 inverted U is converted into the order 2 Y-shaped 
curve. 

We can regard the Hilbert curve of any order as a series of U- 
shaped curves of various orientations, each of which, except for 
the last, is followed by a unit step in a certain direction. In 


transforming a Hilbert curve of one order to the next, each U- 
shaped curve is transformed into a Y-shaped curve with the 
same general orientation, and each unit step is transformed to a 
unit step in the same direction. 


The transformation of the order 1 Hilbert curve (a U curve 
with a net direction to the right and a clockwise rotational 
orientation) to the order 2 Hilbert curve goes as follows: 


1. Draw a U that goes up and has a counterclockwise 
rotation. 


2. Draw a step up. 


3. Draw a U that goes to the right and has a clockwise 
rotation. 


4. Draw a step to the right. 


5. Draw a U that goes to the right and has a clockwise 
rotation. 


6. Draw a step down. 


7. Draw a U that goes down and has a counterclockwise 
rotation. 


We can see by inspection that all U's that are oriented as the 
order 1 Hilbert curve are transformed in the same way. A similar 
set of rules can be made for transforming U's with other 
orientations. These rules are embodied in the recursive program 
shown in Figure 16-3 [Voor]. In this program, the orientation of 
a U curve is characterized by two integers that specify the net 
linear and the rotational directions, encoded as follows: 


dir = 0: right rot = +1: clockwise 
dir = 1: up rot 5-1: counterclockwise 
dir = 2: left 


dir = 3: down 
Actually, dir can take on other values, but its congruency 
modulo 4 is what matters. 


Click here to view code image 


void step(int); 


void hilbert(int dir, int rot, int order) { 


if (order == 0) return; 


dir = dir + rot; 
hilbert (dir, -rot, order - 1); 


step (dir); 

dir = dir =- rot; 

hilbert(dir, rot, order = 1); 
step (dir); 

hilbert(dir, rot, order - 1); 
dir = dir - rot; 

step (dir); 


hilbert(dir, -rot, order - 1); 


FIGURE 16-3. Hilbert curve generator. 


Figure 16-4 shows a driver program and function step that 
is used by program hilbert. This program is given the order of 
a Hilbert curve to construct, and it displays a list of line 
segments, giving for each the direction of movement, the length 
along the curve to the end of the segment, and the coordinates 
of the end of the segment. For example, for order 2 it displays 


Click here to view code image 


0000 00 00 
0001 01 00 
0010 01 01 
0011 00 01 
0100 00 10 
0101 00 11 
0110 01 11 
0111 01 10 
1000 10 10 
1001 10 11 

0 1010 11 11 
-1 1011 11 10 
=. 1100; 11.. 01 
-2 1101 10 01 
-1 1110 10 00 

0 1111 11 00 
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Click here to view code image 


#include <stdio.h> 
#include <stdlib.h> 


int x = -1, y = 0; // Global variables. 
int s = 0; // Dist. along curve. 
int blen; // Length to print. 


void hilbert(int dir, int rot, int order); 
void binary(unsigned k, int len, char *s) { 


/* Converts the unsigned integer k to binary character 
form. Result is string s of length len. */ 


ant 
s[len] = 0; 
for (i = len - 1; i >= 0; i--) { 


if (к & 1) 8111 = ‘1’; 
else s[i] = ‘0’; 
k = k >> 1; 
} 
void step(int dir) { 
char ii[33], xx[17], уу[17]; 


switch(dir & 3) { 


case 0: x = x + 1; break; 
case 1: y = y + 1; break; 
case 2: x = x - 1; break; 
case 3: y = - 1; break; 
} 
binary(s, 2*blen, ii); 
binary(x, blen, xx); 
binary(y, blen, yy); 
printf("$5d %s 855 s\n", dir, ii, xx, yy); 
з= з + 1; // Increment distance. 
} 
int main(int argc, char *argv[]) { 


int order; 


order = atoi(argv[1]); 

blen = order; 

step (0); // Print init. point. 
hilbert(0, 1, order); 

return 0; 


FIGURE 16-4. Driver program for Hilbert curve generator. 


16-2 Coordinates from Distance along the Hilbert 
Curve 


To find the (x, y) coordinates of a point located at a distance s 
along the order n Hilbert curve, observe that the most significant 
two bits of the 2n-bit integer s 


determine which major quadrant the point is in. This is because 
the Hilbert curve of any order follows the overall pattern of the 
order 1 curve. If the most significant two bits of s are 00, the 
point is somewhere in the lower-left quadrant, if 01 it is in the 
upper-left quadrant, if 10 it is in the upper-right quadrant, and if 
11 it is in the lower-right quadrant. Thus, the most significant 
two bits of s determine the most significant bits of the n-bit 
integers x and y, as follows: 


Most significant | Most significant 
two bits of s bits of (x, y) 


(0, 0) 
(0, 1) 
(1,1) 
(1,0) 


In any Hilbert curve, only four of the eight possible U-shapes 
occur. These are shown in Table 16-1 as graphics and as maps 
from two bits of s to a single bit of each of x and y. 


TABLE 16-1. THE FOUR POSSIBLE MAPPINGS 


00 (0,0) | 002 (0,0) | 002 (1, 1) | 002 (1, 1) 
01 (0.1) | 01 (1,0) | 015 (1,0) | ΟΙ — (0, 1) 
10 (1, 1) 10 > (1.1) 10 (0.0) 10 — (0, 0) 


11 (1,0) | 11 -> (0,1) | 11 (0,1) | 11 (10) 


Observe from Figure 16-2 that in all cases the U-shape 


represented by map A Í Е; ) becomes, at ће next level of detail, 
a U-shape represented by maps B, A, A, or D, depending on 
whether the length traversed in the first-mentioned map A is 0, 


1, 2, or 3, respectively. Similarly, a U-shape represented by map 


p | ) becomes, at the next level of detail, a U-shape 
represented by maps A, B, B, or C, depending on whether the 


length traversed in the first-mentioned map B is 0, 1, 2, or 3, 
respectively. 


TABLE 16-2. STATE TRANSITION TABLE FOR COMPUTING (X, Y) 
FROMS 


If the current and the next (to right) then append and enter 
state is two bits of s are to (x, y) state 

(0, 0) 

(0, 1) 


(1, 1) 
(1, 


А 
В 
В 
C 
D 
C 
C 
B 
C 
D 


- 


> 


These observations lead to the state transition table shown in 
Table 16-2, in which the states correspond to the mappings 
shown in Table 16-1. 


To use the table, start in state A. The integer s should be 
padded with leading zeros so that its length is 2n, where n is the 
order of the Hilbert curve. Scan the bits of s in pairs from left to 
right. The first row of Table 16-2 means that if the current state 
is A and the currently scanned bits of s are 00, then output (0, 0) 
and enter state B. Then, advance to the next two bits of s. 
Similarly, the second row means that if the current state is A and 
the scanned bits are 01, then output (0, 1) and stay in state A. 


The output bits are accumulated in left-to-right order. When 


the end of s is reached, the n-bit output quantities x and y are 
defined. 


As an example, suppose n — 3 and 
s — 110100. 


Because the process starts in state A and the initial bits scanned 
are 11, the process outputs (1, 0) and enters state D (fourth 
row). Then, in state D and scanning 01, the process outputs (0, 
1) and stays in state D. Lastly, the process outputs (1, 1) and 
enters state C, although the state is now immaterial. 


Thus, the output is (101, 011)—that is, x — 5 and y — 3. 


A C program implementing these steps is shown in Figure 16- 
5. In this program, the current state is represented by an integer 
from 0 to 3 for states A through D, respectively. In the 
assignment to variable row, the current state is concatenated 
with the next two bits of s, giving an integer from 0 to 15, 
which is the applicable row number in Table 16-2. Variable row 
is used to access integers (expressed in hexadecimal) that are 
used as bit strings to represent the rightmost two columns of 
Table 16-2; that is, these accesses are in-register table lookups. 
Left-to-right in the hexadecimal values corresponds to bottom- 
to-top in Table 16-2. 


Click here to view code image 


void hil xy from s(unsigned s, int n, unsigned *xp, 
unsigned *yp) { 
int i; 
unsigned state, х, y, row; 


state = 0; // Initialize. 
x = y = 0; 
for (i = 2*n = 2; i >= 0; i == 2) { // Do n times. 
row = 4*state | (s >> i) & 3; // Row in table. 
x = (x << 1) | (0x936C >> row) & 1; 
у = (y << 1) | (0x39C6 >> row) & 1; 
state = (0x3E6B94C1 >> 2*row) & 3; // New state. 
} 
*хр = x; // Pass back 


*ур = yi // results. 


FIGURE 16-5. Program for computing (x, y) from s. 


[L&S] give a quite different algorithm. Unlike the algorithm 
of Figure 16-5, it scans the bits of s from right to left. It is based 
on the observation that one can map the least significant two 
bits of s to (x, y) based on the order 1 Hilbert curve, and then 
test the next two bits of s to the left. If they are 00, the values of 
x and y just computed should be interchanged, which 
corresponds to reflecting the order 1 Hilbert curve about the line 
X — y. (Refer to the curves of orders 1 and 2 shown in Figure 
16-1 on page 355.) If these two bits are 01 or 10, the values of x 
and y are not changed. If they are 11, the values of x and y are 
interchanged and complemented. These same rules apply as one 
progresses leftward along the bits of s. They are embodied in 
Table 16-3 and the code of Figure 16-6. It is somewhat curious 
that the bits can be prepended to x and y first, and then the 
swap and complement operations can be done, including these 
newly prepended bits; the results are the same. 


TABLE 16-3. LAM AND SHAPIRO METHOD FOR COMPUTING (X, Y) 
FROM S 


If the next (to left) then and prepend to 
two bits of s are (x, у) 


00 Swap x and y (0, 0) 
01 No change (0, 1) 
10 No change (1. 1) 


11 Swap and complement x and у (1,0) 

Click here to view code image 

void hil xy from s(unsigned s, int n, unsigned *xp, 
unsigned *yp) { 

int i, sa, sb; 


unsigned x, y, temp; 


for (і = 0; i < 2*n; i += 2) { 


за = (s >> (1+1)) & 1; // Get bit itl of s. 
sb = (s >> i) & 1; // Get bit i of s. 
if ((sa * sb) == 0) { // If sa,sb = 00 or 
Ll; 
temp - x; // swap x and y, 
х = y ^ (-sa); // and if sa = 1, 
y = temp ^ (-sa); // complement them. 


x = (x >> 1) | (sa<< 31);// Prepend sa to x and 
у = (у >> 1) | ((sa^ sb) << 31); // (sa ^ sb) to y. 


*xp = x >> (32- n); // Right-adjust x and y 
*yp = у >> (32 - п); // and return them to 
} // the caller. 


FIGURE 16-6. Lam and Shapiro method for computing (x, y) 
from s. 


In Figure 16-6, variables x and y are uninitialized, which 
might cause an error message from some compilers, but the code 
functions correctly for whatever values x and y have initially. 


The branch in the loop of Figure 16-6 can be avoided by 
doing the swap operation with the “three exclusive or" trick 
given in Section 2-20 on page 45. The if block can be replaced 
by the following code, where swap and cmpi are unsigned 
integers: 


Click here to view code image 


swap = (за ^ sb) - 1; // -1 if should swap, else 0. 
cmpl = -(sa & sb); // -1 if should compl’t, else 
0. 

x = x “ y; 

у = у ^ (x & swap) ^ cmpl; 

x = x “ y; 


This is nine instructions, versus about two or six for the if block, 
so the branch cost would have to be quite high for this to be a 
good choice. 


The “swap and complement” idea of [L&S] suggests a logic 
circuit for generating the Hilbert curve. The idea behind the 
circuit, described below, is that as you trace along the path of an 
order n curve, you basically map pairs of bits of s to (x, y) 
according to map A of Table 16-1. As the trace enters various 
regions, the mapping output gets swapped, complemented, or 
both. The circuit of Figure 16-7 keeps track of the swap and 
complement requirements of each stage, uses the appropriate 
mapping to map two bits of s to (xj, yj), and generates the swap 
and complement signals for the next stage. 


Xn-1 Yn-1 Xi Yi Xo Yo 
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FIGURE 16-7. Logic circuit for computing (x, y) from s. 


Assume there is a register containing the path length s and 
circuits for incrementing it. Then, to find the next point on the 
Hilbert curve, first increment s and then transform it as 
described in Table 16-4. This is a left-to-right process, which is a 
bit of a problem because incrementing s is a right-to-left process. 
Thus, the time to generate a new point on an order n Hilbert 
curve is proportional to 2n (for incrementing s) plus n (for 
transforming s to (x, y)). 


TABLE 16-4. LOGIC FOR COMPUTING (X, Y) FROM S 


If the next (to right) 


two bits of s are then append to (x, y) and set 
n 00 (0, 0)* swap = swap 
01 (0, 1)* No change 
10 (1, D* No change 


11 (1, 0)* swap = swap, стр! = стр! 


* Possibly swapped and/or complemented 


Figure 16—7 shows this computation as a logic circuit. In this 
figure, S denotes the swap signal and C denotes the complement 
signal. 


The logic circuit of Figure 16-7 suggests another way to 
compute (x, y) from s. Notice how the swap and complement 
signals propagate from left to right through the n stages. This 
suggests that it might be possible to use the parallel prefix 
operation to quickly (in logon steps rather than n - 1) propagate 
the swap and complement information to each stage, and then 


do some word-parallel logical operations to compute x and y, 
using the equations in Figure 16-7. The values of x and y are 
intermingled in the even and odd bit positions of a word, so they 
have to be separated by the unshuffle operation (see page 140). 
This might seem a bit complicated, and likely to pay off only for 
rather large values of n, but let us see how it goes. 


A procedure for this operation is shown in Figure 16-8 
[GLS1]. The procedure operates on fullword quantities, so it first 
pads the input s on the left with ‘01’ bits. This bit combination 
does not affect the swap and complement quantities. Next, a 
quantity cs (complement-swap) is computed. This word is of the 
form cscs...cs, where each c (a single bit), if 1, means that the 
corresponding pair of bits is to be complemented, and each s 
means that the corresponding pair of bits is to be swapped, 
following Table 16-3. In other words, these two statements map 
each pair of bits of s as follows: 


Click here to view code image 


void hil xy from s(unsigned s, int n, unsigned *xp, 
unsigned *yp) { 
unsigned comp, swap, cs, t, Sr; 


s=s (0x55555555 << 2*n); // Pad s on left with 01 
sr = (s >> 1) & 0x55555555; // (no change) groups. 
cs = ((s & 0x55555555) + sr) // Compute complement & 
^ 0x55555555; // swap info in two-bit 
// groups. 


// Parallel prefix xor op to propagate both complement 
// and swap info together from left to right (there is 
// no step "cs ^= cs >> 1", so in effect it computes 
// two independent parallel prefix operations on two 
// interleaved sets of sixteen bits). 


Gs = cs ^ (08 >> 2); 


cs = cs ^ (cs >> 4); 
cs = cs ^ (cs >> 8); 


cs = cs ^ (cs >> 16); 
swap = cs & 0х55555555; // Separate the swap and 
comp = (cs >> 1) & 0x55555555; // complement bits. 
t = (s & swap) ^ comp; // Calculate x and y in 
s = 8^ sree t^ (b << 1); // the odd & even bit 

// positions, resp. 
s=s & ((1 << 2*n) - 1); // Clear out any junk 


// on the left (unpad). 


// Now “unshuffle" to separate the x and y bits. 


Е = (в ^ (в >> 1)) & 0x22222222; S S ^ t^ (t << 1); 
Е = (s^ (s >> 2)) & 0х0С0С0С0С): $ = $ ^ t^ (t << 2); 
t = (s^ (s >> 4)) & Ox00F000F0; s = s ^t ^ (L << 4); 
Е = (s^ (s >> 8)) & Ox0000FF00; s S^ t^ (t << 8); 
*xp = s >> 16; // Assign the two halves 
*yp = s & OxFFFF; // of t to x and y. 


FIGURE 16-8. Parallel prefix method for computing (x, y) 
from s. 


This is the quantity to which we want to apply the parallel 
prefix operation. PP-XOR is the one to use, going from left to 
right, because successive 1-bits meaning to complement or to 
swap have the same logical properties as exclusive or: Two 
successive 1-bits cancel each other. 


Both signals (complement and swap) are propagated in the 
same PP-XOR operation, each working with every other bit of 


CS. 


The next four assignment statements have the effect of 
translating each pair of bits of s into (x, y) values, with x being 
in the odd (leftmost) bit positions, and y being in the even bit 
positions. Although the logic may seem obscure, it is not difficult 
to verify that each pair of bits of s is transformed by the logic of 
the first two Boolean equations in Figure 16-7. (Suggestion: 
Consider separately how the even and odd bit positions are 
transformed, using the fact that t and sr are 0 in the odd 
positions.) 


The rest of the procedure is self-explanatory. It executes in 66 


basic RISC instructions (constant, branch-free), versus about 19n 
+ 10 (average) for the code of Figure 16-6 (based on compiled 
code; includes prologs and epilogs, which are essentially nil). 
Thus, the parallel prefix method is faster for n = 3. 


16-3 Distance from Coordinates on the Hilbert 
Curve 


Given the coordinates of a point on the Hilbert curve, the 
distance from the origin to the point can be calculated by means 
of a state transition table similar to Table 16-2. Table 16-5 is 
such a table. 


TABLE 16-5. STATE TRANSITION TABLE FOR COMPUTING S FROM 
(X, Y) 


If the current and the next (to right) then append and enter 
state is two bits of (x, y) are to 5 state 
A (0, 0) 00 B 


> 


A 
A 
B 
B 
B 
B 
ο 
ο 
C 
C 
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Its interpretation is similar to that of the previous section. 
First, x and y should be padded with leading zeros so that they 
are of length n bits, where n is the order of the Hilbert curve. 
Second, the bits of x and y are scanned from left to right, and s is 
built up from left to right. 


A C program implementing these steps is shown in Figure 16- 
9. 


Click here to view code image 


unsigned hil s from xy(unsigned x, unsigned y, int n) { 


ant. 12 
unsigned state, s, row; 


state = 0; // Initialize. 

s = 0; 

for (i = n - 1; i >= O; i--) { 
row = 4*state | 2*((x >> і) & 1) | (y >> 1) 8 1; 
S = (s << 2) | (0х361Е9СВ4 >> 2*row) & 3; 
state = (0х8ЕЕ65831 >> 2*row) & 3; 


} 


return s; 


FIGURE 16-9. Program for computing s from (x, y). 


[L&S] give an algorithm for computing s from (x, y) that is 
similar to their algorithm for going in the other direction (Table 
16-3). It is a left-to-right algorithm, shown in Table 16-6 and 
Figure 16-10. 


TABLE 16-6. LAM AND SHAPIRO METHOD FOR COMPUTING S FROM 
(X, Y) 


If the next (to right) then and append 
two bits of (x, y) are tos 


(0, 0) Swap x and y 
(0, 1) No change 


(1.0) Swap and complement x and у 


(1,1) No change 


Click here to view code image 


unsigned hil s from xy(unsigned x, unsigned y, int n) { 


int i, xi, yi; 
unsigned s, temp; 


5 = 0; // Initialize. 


xi = (x >> i) & 1; // Get bit i of x. 
yi = (у >> i) & 1; // Get bit i of y. 


temp = x; // Swap x and y and, 
x = y^(-xi); Hf AE xv =, 
y = temp^ (-xi); // complement them. 
} 
S = 4*s + 2*xi + (xi^yi); // Append two bits to 
5. 
} 
return S; 


FIGURE 16-10. Lam and Shapiro method for computing s 
from (x, y). 


16-4 Incrementing the Coordinates on the Hilbert 
Curve 


Given the (x, y) coordinates of a point on the order n Hilbert 
curve, how can one find the coordinates of the next point? One 
way is to convert (x, y) to s, add 1 to s, and then convert the 
new value of s back to (x, y), using algorithms given above. 


A slightly (but not dramatically) better way is based on the 
fact that as one moves along the Hilbert curve, at each step 
either x or y, but not both, is either incremented or decremented 
(by 1). The algorithm to be described scans the coordinate 
numbers from left to right to determine the type of U-curve that 
the rightmost two bits are on. Then, based on the U-curve and 
the value of the rightmost two bits, it increments or decrements 
either x or y. 


That's basically it, but there is a complication when the path 
is at the end of a U-curve (which happens once every four steps). 
At this point, the direction to take is determined by the previous 
bits of x and y and by the higher order U-curve with which these 
bits are associated. If that point is also at the end of its U-curve, 
then the previous bits and the U-curve there determine the 
direction to take, and so on. 


Table 16-7 describes this algorithm. In this table, the A, B, C, 
and D denote the U-curves as shown in Table 16-1 on page 360. 
To use the table, first pad x and y with leading zeros so they are 
n bits long, where n is the order of the Hilbert curve. Start in 
state A and scan the bits of x and y from left to right. The first 


row of Table 16-7 means that if the current state is A and the 
currently scanned bits are (0, 0), then set a variable to indicate 
to increment y, and enter state B. The other rows are interpreted 
similarly, with a suffix minus sign indicating to decrement the 
associated coordinate. A dash in the third column means do not 
alter the variable that keeps track of the coordinate changes. 


TABLE 16-7. TAKING ONE STEP ON THE HILBERT CURVE 


If the and the next then 
current state (to right) two bits prepare to 
is of (x, y) are inc/dec 


A 
A 
A 
A 
B 
B 
B 
B 
C 
C 
C 
C 


ο ο ο Ὁ 


After scanning the last (rightmost) bits of x and y, increment 
or decrement the appropriate coordinate as indicated by the 
final value of the variable. 


A C program implementing these steps is shown in Figure 16- 
11. Variable ax is initialized in such a way that if invoked many 
times, the algorithm cycles around, generating the same Hilbert 
curve over and over again. (However, the step that connects one 
cycle to the next is not a unit step.) 


Click here to view code image 


void hil inc xy(unsigned *xp, unsigned *yp, int n) { 


ant. X; 
unsigned x, y, state, dx, dy, row, dochange; 


x = *xp; 
y = *yp; 
state = 0; // Initialize. 
dx = -((1<< n) = 1); // Init. -(2**n - 1). 
dy = 0; 
for (i = n-1; i >= 0; i--) { // Do n times. 
row = 4*state | 2*((x >> і) & 1) | (y >> 1) & 1; 
dochange = (0xBDDB >> row) & 1; 
if (dochange) { 
dx = ((0x16451659 >> 2*row) & 3) - 1; 
dy = ((0x51166516 >> 2*row) & 3) - 1; 
} 
state = (0x8FE65831 >> 2*row) & 3; 


*xp = *xp + dx; 
*yp = *yp + dy; 


FIGURE 16-11. Program for taking one step on the Hilbert 
curve. 


Table 16-7 can readily be implemented in logic, as shown in 
Figure 16-12. In this figure, the variables have the following 
meanings: 


Bit 7 of input x 
Bit i of input y 
X Y: x,and у; swapped and complemented, according to δι and C. 


Г, If 1, increment; if 0, decrement (by 1) 

W: If 1, increment or decrement x; if 0, increment or decrement y 
S: If 1, swap x; and y; 

C: If 1, complement x; and y, 


S and C together identify the “state” of Table 16-7, with (C, S) 
— (0,0), (0,1), (1,0), and (1,1) denoting states A, B, C, and D, 
respectively. The output signals are Іо and Wo, which tell, 
respectively, whether to increment or decrement, and which 
variable to change. (In addition to the logic shown, an 


incrementer/decrementer circuit is required, with MUX's to 
route either x or y to the incrementer/decrementer, and a circuit 
to route the altered value back to the register that holds x or y. 
Alternatively, two incrementer/decrementer circuits could be 
used.) 
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FIGURE 16-12. Logic circuit for incrementing (x, y) by one 
step along the Hilbert curve. 


16-5 Non-Recursive Generating Algorithms 


The algorithms of Tables 16-2 and 16-7 provide two non- 
recursive algorithms for generating the Hilbert curve of any 
order. Either algorithm can be implemented in hardware 
without great difficulty. Hardware based on Table 16-2 includes 
a register holding s, which it increments for each step, and then 
converts to (x, y) coordinates. Hardware based on Table 16-7 
would not have to include a register for s, but the algorithm is 
more complicated. 


16-6 Other Space-Filling Curves 


As was mentioned, Peano was first, in 1890, to discover a space- 
filling curve. The many variations discovered since then are 
often called “Peano curves." One interesting variation of 
Hilbert’s curve was discovered by Eliakim Hastings Moore in 
1900. It is “cyclic” in the sense that the end point is one step 


away from the starting point. The Peano curve of order 3, and 
the Moore curve of order 4, are shown in Figure 16-13. Moore's 
curve has an irregularity in that the order 1 curve is upright- 
down ( ГУ ), but this shape does not appear in the higher-order 
curves. Except for this minor exception, the algorithms for 
dealing with Moore's curve are very similar to those for the 
Hilbert curve. 


с осло 


FIGURE 16-13. Peano (left) and Moore (right) curves. 


The Hilbert curve has been generalized to arbitrary rectangles 
and to three and higher dimensions. The basic building block for 
a three-dimensional Hilbert curve is shown below. It hits all 
eight points of a 2х2х2 cube. These and many other space- 
filling curves are discussed in [Sagan]. 


16-7 Applications 


Space-filling curves have applications in image processing: 
compression, halftoning, and textural analysis [L&S]. Another 
application is to improve computer performance in ray tracing, a 
graphics-rendering technique. Conventionally, a scene is scanned 
by projecting rays across the scene in ordinary raster scan line 
order (left to right across the screen, and then top to bottom). 


When a ray hits an object in the simulated scene's database, the 
color and other properties of the object at that point are 
determined, and the results are used to illuminate the pixel 
through which the ray was sent. (This is an oversimplification, 
but it’s adequate for our purposes.) One problem is that the 
database is often large and the data on each object must be 
paged in and cast out as various objects are hit by the scanning 
ray. When the ray scans across a line, it often hits many objects 
that were hit in the previous scan, requiring them to be paged in 
again. Paging operations would be reduced if the scanning had 
some kind of locality property. For example, it might be helpful 
to scan a quadrant of the screen completely before going on to 
another quadrant. 


The Hilbert curve seems to have the locality property we are 
seeking. It scans a quadrant completely before scanning another, 
recursively, and also does not make a long jump when going 
from one quadrant to another. 


Douglas Voorhies [Voor] has simulated what the paging 
behavior would likely be for the conventional uni-directional 
scan line traversal, the Peano curve, and the Hilbert curve. His 
method is to scatter circles of a given size randomly on the 
screen. A scan path hitting a circle represents touching a new 
object, and paging it in. When a scan leaves a circle, it is 
presumed that the object's data remains in memory until the 
scan exits a circle of radius twice that of the "object" circle. 
Thus, if the scan leaves the object for just a short distance and 
then returns to it, it is assumed that no paging operation 
occurred. He repeats this experiment for many different sizes of 
circles, on a simulated 1024 x 1024 screen. 


Assume that entering an object circle and leaving its 
surrounding circle represent one paging operation. Then, clearly 
the normal scan line causes D paging operations in covering a 
(not too big) circle of diameter D pixels, because each scan line 
that enters it leaves its outer circle. The interesting result of 
Voorhies's simulation is that for the Peano curve, the number of 
paging operations to scan a circle is about 2.7 and, perhaps 
surprisingly, is independent of the circle's diameter. For the 
Hilbert curve, the figure is about 1.4, also independent of the 
circle's diameter. Thus, the experiment suggests that the Hilbert 
curve is superior to the Peano curve, and vastly superior to the 


normal scan line path, in reducing paging operations. (The result 
that the page count is independent of the circles’ diameters is 
probably an artifact of the outer circle's being proportional in 
size to the object circle.) 


The Hilbert curve has been used to assign jobs to processors 
when the processors are interconnected in a rectangular 2D or 
3D grid [Cplant]. The processor allocation system software uses 
a linear list of the processors that follows a Hilbert curve over 
the grid. When a job that requires a number of processors is 
scheduled to run, the allocator allocates them from the linear 
list, much as a memory allocator would do. The allocated 
processors tend to be close together on the grid, which leads to 
good intercommunication properties. 


Exercises 


1. A simple way to cover an n X n grid in a way that doesn't 
make too many big jumps, and hits every point once and 
only once, is to have a 2n-bit variable s that is 
incremented at each step, and form x from the first and 
every other bit of s, and y from the second and every 
other bit of s. This is equivalent to computing the perfect 
outer unshuffle of s, and then letting x and y be the left 
and right halves of the result. Investigate this curve's 
locality property by sketching the curve for n = 3. 


2. A variation of exercise 1 is to first transform s into Gray(s) 
(see page 312), and then let x and y be formed from 
every other bit of the result, as in exercise 1. Sketch the 
curve for n — 3. Has this improved the locality property? 


3. How would you construct a three-dimensional analog of 
the curve of exercise 1? 


Chapter 17. Floating-Point 


God created the integers, 
all else is the work of man. 


Leopold Kronecker 


Operating on floating-point numbers with integer arithmetic and 
logical instructions is often a messy proposition. This is 
particularly true for the rules and formats of the IEEE Standard 
for Floating-Point Arithmetic, IEEE Std. 754-2008, commonly 
known as "IEEE arithmetic." It has the NaN (not a number) and 
infinities, which are special cases for almost all operations. It has 
plus and minus zero, which must compare equal to one another. 
It has a fourth comparison result, “unordered.” The most 
significant bit of the fraction is not explicitly present in “normal” 
numbers, but it is in ^subnormal" numbers. The fraction is in 
signed-true form and the exponent is in biased form, whereas 
integers are now almost universally in two's-complement form. 
There are, of course, reasons for all this, but it results in 
programs that deal with the representation being full of tests and 
branches, and that present a challenge to implement efficiently. 


We assume the reader has some familiarity with the IEEE 
standard, and summarize it here only very briefly. 


17-1 IEEE Format 


The 2008 standard includes three binary and two decimal 
formats. We will restrict our attention to the binary "single" and 
“double” formats (32- and 64-bit). These are shown below. 


Single format Double format 
s| e f 5 е f 
a 23 I: 52 
The sign bit s is encoded as O for plus, 1 for minus. The 
biased exponent e and fraction f are magnitudes with their most 
significant bits on the left. The floating-point value represented 
is encoded as shown on the next page. 


Single format Double format 
e / value e f value 
+0 0 0 +0 


+2-126(0 f) +2-1022(0 f) 


127, 


1 to 254 PANIA 1 to 2046 +2021 f 
255 too 2047 
255 2047 NaN 


As an example, consider encoding the number л in single 
format. In binary [Кпи1], 


л = 11.0010 0100 0011 1111 0110 1010 1000 1000 1000 
0101 1010 0011 0000 10... 


This is in the range of the “normal” numbers shown in the 
third row of the table above. The most significant 1 in л is 
dropped, as the leading 1 is not stored in the encoding of normal 
numbers. The exponent e - 127 should be 1, to get the binary 
point in the right place, and hence e = 128. Thus, the 
representation is 


Click here to view code image 


0 10000000 10010010000111111011011 


or, in hexadecimal, 
Click here to view code image 
40490FDB, 


where we have rounded the fraction to the nearest representable 
number. 


Numbers with 1 < e < 254 are the “normal numbers." These 
are “normalized,” meaning that their most significant bit is 1 
and it is not explicitly stored. Nonzero numbers with e = 0 are 
called “subnormal numbers,” or simply “subnormals.” Their 
most significant bit is explicitly stored. This scheme is sometimes 
called *gradual underflow." Some extreme values in the various 
ranges of floating-point numbers are shown in Table 17-1. In 
this table, *Max integer" means the largest integer such that all 
integers less than or equal to it, in absolute value, are 
representable exactly; the next integer is rounded. 


For normal numbers, one unit in the last position (ulp) has a 
relative value ranging from 1 / 224 to 1 / 223 (about 5.96 x 10- 


8 to 1.19 x 10-7) for single format, and from 1 / 25? to 1 / 252 
(about 1.11 x 10-16 to 2.22 x 10-16) for double format. The 
maximum "relative error," for round to nearest mode, is half of 
those figures. 


The range of integers that is represented exactly is from -224 
to +224(-16,777,216 to +16,777,216) for single format, and 
from -253 to + 253(-9,007,199,254,740,992 to 
+ 9,007,199,254,740,992) for double format. Of course, certain 
integers outside these ranges, such as larger powers of 2, can be 
represented exactly; the ranges cited are the maximal ranges for 
which all integers are represented exactly. 


TABLE 17-1. EXTREME VALUES 


Single Precision 


Hex Exact Value Approximate Value 

Smallest subnormal 0000 0001 | 2-9 1.401107 

Largest subnormal 007Е FFFF | 2"7*(1-275) 1.175x102* 
Smallest normal 0080 0000 | 2! 1.175x10?5 

1.0 3F80 0000 l | 

Max integer 4B80 0000 | 23 1.677x107 

Largest normal ТЕТЕ FFFF | 2?*1-27*) 3.403x10?* 

o0 7F80 0000 ορ © 


Double Precision 
Smallest subnormal 2-1074 4.941x10324 


> 


Largest subnormal 1 2719201 — 2-52) 


53-1022 


Smallest normal 


1.0 55 1 
Max integer is 9.007х 1077 


Largest normal RR 3 1.798х 10398 


oo TS on оо 


One might want to change division by a constant to 
multiplication by the reciprocal. This can be done with complete 
(IEEE) accuracy only for numbers whose reciprocals are 
represented exactly. These are the powers of 2 from 2-177 to 
2127 for single format, and from 2-1023 to 21023 for double 
format. The numbers 2-127 and 2-1023 are subnormal numbers, 
which are best avoided on machines that implement operations 
on subnormal numbers inefficiently. 


17-2 Floating-Point To/From Integer Conversions 


Table 17-2 gives some formulas for conversion between IEEE 
floating-point format and integers. These methods are concise 
and fast, but they do not give the correct result for the full range 
of input values. The ranges over which they do give the 
precisely correct result are given in the table. They all give the 
correct result for +0.0 and for subnormals within the stated 
ranges. Most do not give a reasonable result for a NaN or 
infinity. These formulas may be suitable for direct use in some 
applications, or in a library routine to get the common cases 
quickly. 


TABLE 17-2. FLOATING-POINT CONVERSIONS 


Range 


Double to 
10164, n 


1 
Double to м | (x1ce50- сә -25 to 251+] 
int64, u Е 


Double to if (x > 0.0) 
int64, z 


1 
(x 655) – Cs; 
else 
€5; — (65: х) 


Double to хс) —Ce —0.25 to 252 
uint64, n . . 
Double to = 5 
uint64, d 


Double to —231 — 0.5 to 23! — 0.5 — шр 
int32 or 


uint32, n or —0.5 to 232 — 0.5 — ulp 


Double to 


int32 or 7 
uint32, d 0 to 222 — ulp 


—23! to 2?! — ulp, or 


Double to u | low32(x σοι) -23-144 рїо23-1,ог | 1 
int32 or 
uint32, u 


Double to d | if (x» 0.0) —231 — 1+ шр to 2?! — ulp, 
int32 or ог [| low32(x 16,4) 32 
uint32, z 2 | else ив or —T*ulp to 2^" —ulp 
—low32(¢5>, “ x) 
or 
z 


—1+ulp to 232-1 


Float to (x 165) — € -22 to 222+ 0.5 
int32, n 
Float to d (x 1€934) – ©з 
int32, d 
Float to (x {ез )— €» -222 {0 222 + | 
int32, u 


Float to d | if (x 2 0.0) —223 to 223 
SESS | 
int32, z (x с) - 65 
else 
C23 — (055 +x) 


Float to (x fcz) — οι -0.25 to 23 
uint32, n 

Float to (x 163) - ο 0 to 22 
uint32, d 


d 
(x $655) 2.655 —25! to 231 + 0.5 


uint32, u 


Round 
double to 
nearest 


Round non- (x 165) Ч co -0.25 to 252 
negative 


22 


Round float 
to nearest 


double to 


nearest 
Round non- —0.25 to 223 


negative 
float to 
nearest 


Int64 to 
double 


ә 
221 


Γ᾽ 


Uint64 to - | (х+е;,) cs; 0 to 222 — 1 4 
double 

Int32 to (X + суу) 56» -222 to 222 

float 


Uint32 to (x + Сэ) 5635 0 to 223 
float 


Constants: 


Сә = 0x43300000 00000000 = 252 
Созу = 0x4B400000 = 223 + 222 
соз = 0x4B000000 = 23 


Notes: 
The floating-point operations must be done in IEEE double-precision (53 bits 
of precision) and no more. Most Intel machines do not, by default, operate in 
this mode. On those machines it is necessary to set the precision (PC field in 
the FPU Control Word) to double-precision. 
The floating-point operations must be done in IEEE single-precision (24 bits of 
precision) and no more. Most Intel machines are not, by default, operated in 
this mode. On those machines it is necessary to set the precision (PC field in 
the FPU Control Word) to single-precision. 
“Nonnegative” means —0.0 or greater than or equal to 0.0 
To convert a 32-bit signed or unsigned integer to double, sign- or zero-extend 
the 32-bit integer to 64 bits and use the appropriate one of these formulas. 


The Type column denotes the type of conversion desired, 
including the rounding mode: n for round to nearest even, d for 
round down, u for round up, and z for round toward zero. The R 
column denotes the rounding mode that the machine must be in 
for the formula to give the correct result. (On some machines, 
such as the Intel IA-32, the rounding mode can be specified in 
the instruction itself, rather than in a *mode" register.) 


A "double" is an IEEE double, which is 64 bits in length. A 
"float" is an IEEE single, which is 32 bits in length. 


The notation “ulp” means one unit in the last position. For 
example, 1.0 - ulp denotes the IEEE-format number that is 
closest to 1.0 but less than 1.0, something like 0.99999.... The 
notation “int64” denotes a signed 64-bit integer (two's- 
complement), and “int32” denotes a signed 32-bit integer. 
“uint64” and “uint32” have similar meanings, but for unsigned 
interpretations. 


The function low32(x) extracts the low-order 32 bits of x. 


d 5 
The operators + and denote double- and single-precision 


floating-point addition, respectively. Similarly, the operators ©. 


and 5 denote double- and single-precision subtraction. 


It might seem curious that on most Intel machines the double 
to integer (of any size) conversions require that the machine's 
precision mode be reduced to 53 bits, whereas for float to 
integer conversions, the reduction in precision is not necessary— 
the correct result is obtained with the machine running in 
extended-precision mode (64 bits of precision). This is because 
for the double-precision add of the constant, the fraction might 
be shifted right as many as 52 bits, which may cause 1-bits to be 
shifted beyond the 64-bit limit, and hence lost. Thus, two 
roundings occur—first to 64 bits and then to 53 bits. On the 
other hand, for the single-precision add of the constant, the 
maximum shift is 23 bits. With that small shift amount, no bit 
can be shifted beyond the 64-bit boundary, so that only one 
rounding operation occurs. The conversions from float to integer 
get the correct result on Intel machines in all three precision 
modes. 


((x £c) £e) - ει. 

On Intel machines running in extended-precision mode, the 
conversions from double to int64 and uint64 can be done 
without changing the precision mode by using different 
constants and one more floating-point operation. The calculation 

Е 
is where + and denote extended-precision addition апа 
subtraction, respectively. (The result of the add must remain in 
the 80-bit register for use by the extended-precision subtract 
operation.) 


For double to int64, 


с1 = 0x43E00300 00000000 = 263 + 252 + 251 
c2 = 0x43E00000 00000000 = 263 
сз = 0x43380000 00000000 = 252 + 251. 


For double to uint64, 


с1 = 0x43E00200 00000000 = 263 + 252 
с2 = 0x43E00000 00000000 = 263 
сз = 0x43300000 00000000 = 252. 


Using these constants, similar expressions can be derived for 
the conversion and rounding operations shown in Table 17-2 
that are flagged by Note 1. The ranges of applicability are close 
to those shown in the table. 


However, for the round double to nearest operation, if the 


calculation subtracts first and then adds, that is, 


((x£c) £65) + с; 
(using the first set of constants above), then the range for which 
the correct result is obtained is – 251 — 0.5 to œ, but not а NaN. 


17-3 Comparing Floating-Point Numbers Using 
Integer Operations 

One of the features of the IEEE encodings is that non-NaN values 
are properly ordered if treated as signed magnitude integers. 


To program a floating-point comparison using integer 
operations, it is necessary that the “unordered” result not be 
needed. In IEEE 754, the unordered result occurs when one or 
both comparands are NaNs. The methods below treat NaNs as if 
they were numbers greater in magnitude than infinity. 


The comparisons are also much simpler if -0.0 can be treated 
as strictly less than +0.0 (which is not in accordance with IEEE 
754). Assuming this is acceptable, the comparisons can be done 


r 4 - : 
as shown below, where <, 5, and £ denote floating-point 
comparisons, and the = symbol is used as a reminder that these 


formulas do not treat +0.0 quite right. These comparisons are 
the same as IEEE 754-2008's “total-ordering” predicate. 


a «Lb (a= b) 
aibz(az0&acb)|(ac0&azb) 


а25-(4204а-5)| (a<0&aSb) 

If -0.0 must be treated as equal to +0.0, there does not seem 
to be any slick way to do it, but the following formulas, which 
follow more or less obviously from the above, are possibilities. 

aLb=(a=b) | (-a=a&-b=b) 
=(a=b) | ((a | b) = 0x80000000) 
= (a- b) | (((а | b) & OX7TFFFFFFF) = 0) 
a Lb =((a>0&a<b) | (a<0&a=b))& (ία | Б) = 0x80000000) 


aib=(a>0&a<b) | (a<0&atb) | (ία | b) = 0x80000000) 


In some applications, it might be more efficient to first 
transform the numbers in some way, and then do a floating- 
point comparison with а single fixed-point comparison 
instruction. For example, in sorting n numbers, the 
transformation would be done only once to each number, 


. 1 ο . . 
whereas a comparison must be done at least | nlog;n | times (in 


the minimax sense). 

Table 17-3 gives four such transformations. For those in the 
left column, -0.0 compares equal to -- 0.0, and for those in the 
right column, -0.0 compares less than +0.0. In all cases, the 
sense of the comparison is not altered by the transformation. 
Variable n is signed, t is unsigned, and c may be either signed 
or unsigned. 


The last row shows branch-free code that can Бе 
implemented on our basic RISC in four instructions for the left 
column, and three for the right column (these four or three 
instructions must be executed for each comparand). 


TABLE 17-3. PRECONDITIONING FLOATING-POINT NUMBERS FOR 
INTEGER COMPARISONS 


—0.0 =+0.0 (IEEE) 


if (n >= 0) n = n+0x80000000; 
else n = -n; 


—0.0 <+0.0 (non-TEEE) 


if (n >= 0) n = n*0x80000000; 
else n = -n; 


Use unsigned comparison. 


Ox7FFFFFFF; 
if (n < 0) n= (n ^ с) + 1; 
Use signed comparison. 


0x80000000; 
if (n < 0) n = c - n; 
signed comparison. 


n >> 31; 
(n^ (t >> 1)) - t; 
signed comparison. 


Use unsigned comparison. 


c = Ox7FFFFFFF; 
if (n < 0) n = n ^ с; 
Use signed comparison. 


с = Ox7FFFFFFF; 
if (n < 0) n = c - n; 
Use signed comparison. 


t (unsigned)(n>>30) >> 1; 
n ^ t$; 
Use signed comparison. 


17-4 An Approximate Reciprocal Square Root 
Routine 


In the early 2000s, there was some buzz in programming circles 
about an amazing routine for computing an approximation to 
the reciprocal square root of a number in IEEE single format. 
The routine is useful in graphics applications, for example, to 


normalize a vector by multiplying its components x, y, and z by 


| /4x? + y? + 22. C code for the function is shown in Figure 17- 
1 [Taro]. 

The relative error of the result is in the range 0 to -0.00176 
for all normal single-precision numbers (it errs on the low side). 
It gives the correct IEEE result (NaN) if its argument is a NaN. 
However, it gives an unreasonable result if its argument is + оо, 
a negative number, or -0. If the argument is +0 or a positive 
subnormal, the result is not what it should be, but it is a large 
number (greater than 9 x 1018), which might be acceptable in 
some applications. 


The relative error can be reduced in magnitude, to the range 
+ 0.000892, by changing the constant 1.5 in the Newton step to 
1.5008908. 


Another possible refinement is to replace the multiplication 
by 0.5 with a subtract of 1 from the exponent of x. That is, 
replace the definition of хһа1ғ with 


Click here to view code image 


union {int ihalf; float xhalf;}; 
ihalf = ix - 0x00800000; 


However, the function then gives inaccurate results (although 
greater than 6 x 1018) for x a normal number less than about 
2.34 x 10°38, and NaN for x a subnormal number. For x = 0 
the result is + »» (which is correct). 


The Newton step is a standard Newton-Raphson calculation 
for the reciprocal square root function (see Appendix B). Simply 
repeating this step reduces the relative error to the range 0 to 
-0.0000047. The optimal constant for this is Ox5F37599E. 


On the other hand, deleting the Newton step results in a 
substantially faster function with a relative error within + 0.035, 
using a constant of 0x5F37642F. It consists of only two integer 
instructions, plus code to load the constant. (The variable хһа1ғ 
can be deleted.) 


Click here to view code image 


float rsqrt(float x0) { 
union {int ix; float x;}; 


x = х0; // x can be viewed as 
ти. 


float xhalf = 0.5f*x; 
ix = 0х5Е375а82 - (ix >> 1); // Initial guess. 
x — x*(1.5f - xhalf*x*x); // Newton step. 


return x; 


FIGURE 17-1. Approximate reciprocal square root. 


To get an inkling of why this works, suppose x = 2n (1 + f), 
where n is the unbiased exponent and f is the fraction (0 < f < 
1). Then 


р Ж -2 ni2¢ | +f) 123 

dx 
Ignoring the fraction, this shows that we must change the biased 
exponent from 127 + n to 127 -n/2. If e = 127 +n, then 127 - 
n/2 = 127 - (e - 127)/2 = 190.5 -e/2. Therefore, it appears 
that a calculation something like shifting x right one position 
and subtracting it from 190 in the exponent position, might give 


a very rough approximation to UA ) In C, this can be 
expressed asi 


Click here to view code image 
union {int ix; float x;}; // Make ix and x overlap. 
0х5Е000000 - (ix >> 1); // Refer to x as integer ix. 


To find a better value for the constant Ox5F000000 by 
analysis is difficult. Four cases must be analyzed: the cases in 
which a 0-Ы or a 1-bit is shifted from the exponent field to the 
fraction field, and the cases in which the subtraction does or 
does not generate a borrow that propagates to the exponent 
field. This analysis is done in [Lomo]. Here, we make some 
simple observations. 


Using rep(x) to denote the representation of the floating- 
point number x in IEEE single format, we want a formula of the 
form 


гері ГАРЭ = К (гер(х) >> 1) 


for some constant k. (Whether the shift is signed or unsigned 
makes no difference, because we exclude negative values of x 
and -0.0.) We can get an idea of roughly what k should be from 


k = гер(1 ИМ) + (rep(x) > 1). 
and trying а few values of x. The results are shown in Table 17- 
4 (in hexadecimal). 


It looks like k is approximately a constant. Notice that the 
same value is obtained for x — 1.0 and 4.0. In fact, the same 
value of k results from any number x and 4x (provided they are 
both normal numbers). This is because, in the formula for k, if x 


is quadrupled, then the term гер(1/4Х) decreases by 1 in the 
exponent field, and the term rep\*) >> l increases by 1 in the 
exponent field. 

More significantly, the relative errors for x and 4x are exactly 
the same, provided both quantities are normal numbers. To see 
this, it can be shown that if the argument x of the rsqrt 
function is quadrupled, the result of the function is exactly 
halved, and this is true no matter how many Newton steps are 


done. Of course, 1//х is also halved. Therefore, the relative 
error is unchanged. 


TABLE 17-4. DETERMINING THE CONSTANT 


3F800000 
3FC00000 
40000000 


40200000 
40400000 
40600000 
40800000 


3F800000 
3F5105EC 
3F3504F3 
3F21E89B 
3F13CD3A 
3F08D677 
3F000000 


5F400000 
5F3105EC 
5F3504F3 
5F31E89B 
5F33CD3A 
5F38D677 
5F400000 


This is important, because it means that if we find an optimal 
value (by some criterion, such as minimizing the maximum 
absolute value of the error) for values of x in the range 1.0 to 
4.0, then the same value of k is optimal for all normal numbers. 


It is then a straightforward task to write a program that, for a 
uen value of k, calculates the true value of 1//х (using a 
nown accurate library routine) and the estimated value for 
some 10,000 or so values of x from 1.0 to 4.0, and calculates the 
maximum error. The optimal value of k can be determined by 
hand, which is tedious but sometimes illuminating. It is quite 
amazing that there is a constant for which the error is less than 
+ 3.596 in a function that uses only two integer operations and 
no table lookup. 


17-5 The Distribution of Leading Digits 


When IBM introduced the System/360 computer in 1964, 
numerical analysts were horrified at the loss of precision of 
single-precision arithmetic. The previous IBM computer line, the 
704 - 709 - 7090 family, had a 36-bit word. For single-precision 
floating-point, the format consisted of a 9-bit sign and exponent 
field, followed by a 27-bit fraction in binary. The most 
significant fraction bit was explicitly included (in “normal” 
numbers), so quantities were represented with a precision of 27 
bits. 


The S/360 has a 32-bit word. For single-precision, IBM chose 
to have an 8-bit sign and exponent field followed by a 24-bit 
fraction. This drop from 27 to 24 bits was bad enough, but it 
gets worse. To keep the exponent range large, a unit in the 7-bit 
exponent of the $/360 format represents a factor of 16. Thus, 
the fraction is in base 16, and this format came to be called 
“hexadecimal” floating-point. The leading digit can be any 
number from 1 to 15 (binary 0001 to 1111). Numbers with 
leading digit 1 have only 21 bits of precision (because of the 
three leading 0’s), but they should constitute only 1/15 (6.7%) 
of all numbers. 


No, it's worse than that! There was a flurry of activity to 
show, both analytically and empirically, that leading digits are 
not uniformly distributed. In hexadecimal floating-point, one 
would expect 2596 of the numbers to have leading digit 1, and 
hence only 21 bits of precision. 


Let us consider the distribution of leading digits in decimal. 
Suppose you have a large set of numbers with units, such as 
length, volume, mass, speed, and so on, expressed in "scientific" 
notation (e.g., 6.022 x 1023). If the leading digit of a large 


number of such numbers has a well-defined distribution 
function, then it must be independent of the units—whether 
inches or centimeters, pounds or kilograms, and so on. Thus, if 
you multiply all the numbers in the set by any constant, the 
distribution of leading digits should be unchanged. For example, 
considering multiplying by 2, we conclude that the number of 
numbers with leading digit 1 (those from 1.0 to 1.999... times 10 
to some power) must equal the number of numbers with leading 
digit 2 or 3 (those from 2.0 to 3.999... times 10 to some power), 
because it shouldn't matter if our unit of length is inches or half 
inches, or our unit of mass is kilograms or half kilograms, and so 
on. 


Let f(x), for 1 < x « 10, be the probability density function 
for the leading digits of the set of numbers with units. f(x) has 


the property that 
μ x)dx 


is the proportion of numbers that have leading digits ranging 
from a to b. Referring to the figure below, for a small increment 
A x in x, f must satisfy 


JQ): Ax = fx): xAx, 


| х X+xAXx 10 


because f (1) · Ах is, approximately, the proportion of numbers 
ranging from 1 to 1 + Ax (ignoring a multiplier of a power of 
10), and f(x) · х Ax is the approximate proportion of numbers 
ranging from x to x + x Ax. Because the latter set is the first set 
multiplied by x, their proportions must be equal. Thus, the 
probability density function is a simple reciprocal relationship, 


fo) = К) / х. 


Because the area under the curve from x = 1 tox = 10 must 
be 1 (all numbers have leading digits from 1.000... to 9.999...), 
it is easily shown that 


fa) = 1/n10. 


The proportion of numbers with leading digits in the range a 
tob, with 1 < a < b < 10, is 


' dx _ Inx 
}xInlO  InlO 
Thus, in decimal, the proportion of numbers with leading digit 1 


is 10810(2 / 1) = 0.30103, and the proportion of numbers with 
leading digit 9 is log10(10 / 9) = 0.0458. 


For base 16, the proportion of numbers with leading digits in 
the range a to b, with 1 < a < b < 16, is similarly derived to be 
log1e(b / a). Hence, the proportion of numbers with leading 
digit 1 is log16(2 / 1) = 1/1og216 = 0.25. 


’ _ Inb/a 
In 10 


= Ь 
= 108197 
a 


a 


17-6 Table of Miscellaneous Values 


Table 17-5 shows the IEEE representation of miscellaneous 
values that may be of interest. The values that are not exact are 
rounded to the nearest representable value. 


TABLE 17-5. MISCELLANEOUS VALUES 


Single Format (Hex) | Double Format (Hex) 


-00 FF80 0000 FFFO 0000 0000 0000 
-2.0 C000 0000 C000 0000 0000 0000 
-1.0 BF80 0000 BFFO 0000 0000 0000 
0.5 ΒΕΟΟ 0000 BFEO 0000 0000 0000 
—0.0 8000 0000 8000 0000 0000 0000 
+0.0 0000 0000 0000 0000 0000 0000 
Smallest positive subnormal 0000 0001 0000 0000 0000 0001 
Largest subnormal 007F FFFF 000F FFFF FFFF FFFF 
Least positive normal 0080 0000 0010 0000 0000 0000 
7/180 (0.01745...) 3C8E FA35 3F91 DF46 A252 9D39 
0.1 3DCC CCCD 3FB9 9999 9999 999A 
logio 2 (0.3010...) 3E9A 209B 3FD3 4413 509F 79FF 
l/e (0.3678...) 3EBC 5AB2 3FD7 8B56 362C EF38 


1Лп 10 (0.4342...) 3EDE 5BD9 3FDB CB7B 1526 E50E 


0.5 
In 2 (0.6931...) 


ΙΑ (0.7071...) 
1Лп 3 (0.9102...) 
1.0 

In 3 (1.0986...) 


42. (1414...) 
VIn 2 (1.442...) 


A (1.732... 
2.0 

In 10 (2.302...) 
е (2.718...) 
3.0 

п (3.141...) 


„ЛО (3.162...) 
log» 10 (3.321...) 
40 

50 

6.0 

2л (6283...) 

70 

8.0 

9.0 

10.0 

11.0 

12.0 

13.0 

14.0 

15.0 

16.0 

180/7 (57.295...) 


23-41 


М2 


3F00 
3F31 


3F35 
3F69 
3F80 
3F8C 


3FB5 
3FB8 


3FDD 
4000 
4013 
402D 
4040 
4049 


404A 
4054 
4080 
40A0 
4060 
4069 
4010 
4100 
4110 
4120 
4130 
4140 
4150 
4160 
4170 
4180 
4265 


4AFF 


0000 
7218 


04F3 
0570 
0000 
9F54 


04ЕЗ 
ААЗВ 


B3D7 
0000 
5D8bE 
F854 
0000 
OFDB 


62C2 
9A78 
0000 
0000 
0000 
OFDB 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
2ЕЕ1 


FFFE 


ЗЕЕО 
3FE6 


3FE6 
3FED 
3FFO 
3FF1 


3FF6 
3FF7 


3FFB 
4000 
4002 
4005 
4008 
4009 


4009 
400A 
4010 
4014 
4018 
4019 
401C 
4020 
4022 
4024 
4026 
4028 
402A 
402C 
402E 
4030 
404C 


415F 


0000 
2E42 


AO9E 
20AE 
0000 
93EA 


А09Е 
1547 


B67A 
0000 
6BB1 
BFOA 
0000 
21FB 


4C58 
934F 
0000 
0000 
0000 
21FB 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
A5DC 


FFFF 


0000 
FEFA 


667F 
03BC 
0000 
7AAD 


667F 
652B 


E858 
0000 
BBB5 
8814 
0000 
5444 


3ADA 
0979 
0000 
0000 
0000 
5444 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
1A63 


C000 


0000 
39EF 


3BCD 
C153 
0000 
030B 


3BCD 
82FE 


4CAA 
0000 
5516 
5769 
0000 
2D18 


5B53 
A371 
0000 
0000 
0000 
2D18 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
C1F8 


0000 


25 4800 
2-1 4B7F 
24 4B80 
235-1 4F00 
231 4200 
222—1 4F80 
27 4F80 
252 5980 
263 5F00 
25 5F80 
Largest normal 7F7F 
ao 7F80 
“Smallest” SNaN 7F80 
"Largest" SNaN 7FBF 
“Smallest” QNaN 7FCO 
"Largest" QNaN 7FFF 


IEEE 754 does not specify how the signaling and quiet NaNs 
are distinguished. Table 17-5 uses the convention employed by 
PowerPC, the AMD 29050, the Intel x86 and 1860, the SPARC, 
and the ARM family: The most significant fraction bit is O for 
signaling and 1 for quiet NaN's. A few machines, mostly older 


0000 
FFFF 
0000 
0000 
0000 
0000 
0000 
0000 
0000 


0000 
FFFF 
0000 
0001 
FFFF 
0000 
FFFF 


4160 
416F 
4170 
41DF 
41E0 
41ЕЕ 
4120 
4330 
43E0 
43F0 
7FEF 
ТЕЕО 
7FFO 


7FF7 
7FF8 


7FFF 


0000 
FFFF 
0000 
FFFF 
0000 
FFFF 
0000 
0000 
0000 


0000 
FFFF 
0000 
0000 
FFFF 
0000 
FFFF 


0000 
E000 
0000 
FFCO 
0000 
FFEO 
0000 
0000 
0000 


0000 
FFFF 
0000 
0000 
FFFF 
0000 
FFFF 


0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 


0000 
FFFF 
0000 
0001 
FFFF 
0000 
FFFF 


ones, use the opposite convention (0 = quiet, 1 = signaling). 


Exercises 


1. What numbers have the same representation, apart from 


trailing 0’s, in both single- and double-precision? 


2. Is there a program similar to the approximate reciprocal 
square root routine for computing the approximate 


square root? 


3. Is there a similar program for the approximate cube root 
of a nonnegative normal number? 


4. Is there a similar program for the reciprocal square root of 
a double-precision floating-point number? Assume it is 
for a 64-bit machine, or at any rate that the “long long” 


(64-bit integer) data type is available. 


Chapter 18. Formulas For Primes 


18-1 Introduction 


Like many young students, I once became fascinated with prime 
numbers and tried to find a formula for them. I didn't know 
exactly what operations would be considered valid in a 
"formula," or exactly what function I was looking for—a formula 
for the nth prime in terms of n, or in terms of the previous 
prime(s), or a formula that produces primes but not all of them, 
or something else. Nevertheless, in spite of these ambiguities, I 
would like to discuss a little of what is known about this 
problem. We will see that (a) there are formulas for primes, and 
(b) none of them are very satisfying. 


Much of this subject relates to the present work in that it 
deals with formulas similar to those of some of our programming 
tricks, albeit in the domain of real number arithmetic rather 
than “computer arithmetic." Let us first review a few highlights 
from the history of this subject. 


In 1640, Fermat conjectured that the formula 
Fh = 22n +1 


always produces a prime, and numbers of this form have come 
to be called “Fermat numbers.” It is true that Επ is prime for n 
ranging from 0 to 4, but Euler found in 1732 that 


Е5 = 225 + 1 = 641: 6700417. 


(We have seen these factors before in connection with dividing 
by a constant on a 32-bit machine). Then, F. Landry showed in 
1880 that 


Fg = 226 + 1 = 274177-67280421310721. 


F, is now known to be composite for many larger values of n, 
such as all n from 7 to 16 inclusive. For no value of n > 4 is it 
known to be prime [H&W]. So much for rash conjectures. 1 


Incidentally, why would Fermat be led to the double 
exponential? He knew that if m has an odd factor other than 1, 


then 2" + 1 is composite. For if m = ab with b odd and not 
equal to 1, then 


2ab + 1 = (2a + 1)(2a(b - D -2a(p - 2 +: 2a - 3) —... + 1. 


Knowing this, he must have wondered about 2m + 1 with m not 
containing any odd factors (other than 1)—that is, m = 2n. He 
tried a few values of n and found that 22n + 1 seemed to be 
prime. 


Certainly everyone would agree that a polynomial qualifies as 
a "formula." One rather amazing polynomial was discovered by 
Leonhard Euler in 1772. He found that 


Кп) = п + n 41 


is prime-valued for every n from 0 to 39. His result can be 
extended. Because 


fn) = n2 n + 41 = п -1), 


fn) is prime-valued for every n from 1 to 40; that is, f(n) is 
prime-valued for every n from -1 to —40. Therefore, 


Кп -40) = (n -40)? + (n -40) + 41 = п2 - 79n + 1601 


is prime-valued for every n from 0 to 79. (However, it is lacking 
in aesthetic appeal because it is nonmonotonic and it repeats; 
that is, for n = 0, 1, ..., 79, п2-79 n + 1601 = 1601, 1523, 
1447, ..., 43, 41, 41, 43, ..., 1447, 1523, 1601.) 


In spite of this success, it is now known that there is no 
polynomial f(n) that produces a prime for every n (aside from 
constant polynomials such as f(n) — 5). In fact, any nontrivial 
“polynomial in exponentials” is composite infinitely often. More 
precisely, as stated in [H & W], 

THEOREM. If f(n) = p(n, 2n, Зп,..., kn) is a polynomial in its 
arguments, with integral coefficients, and f(n) > οο when n — =, 
then f(n) is composite for an infinity of values of n. 
Thus, a formula such as n? : 2n + 2n? + 2n + 5 must produce 
an infinite number of composites. On the other hand, the 
theorem says nothing about formulas containing terms such as 
22n, пп, and n!. 

A formula for the nth prime, in terms of n, can be obtained by 


using the floor function and a magic number 
a — 0.203005000700011000013.... 


The number a is, in decimal, the first prime written in the first 
place after the decimal point, the second prime written in the 
next two places, the third prime written in the next three places, 
and so on. There is always room for the nth prime, because pn < 
10^, We will not prove this, except to point out that it is known 
that there is always a prime between n and 2n (for n = 2), and 
hence certainly at least one between n and 10n, from which it 
follows that p; < 10n. The formula for the nth prime is 


р„ = |10 α|- w[10 2 αἰ. 


where we have used the relation 1 + 2 + 3 + ... + n = (n2 + 
n) / 2. For example, 


p, = | 10% | — 103| 102a | 


= 5. 
This is а pretty cheap trick, as it requires knowledge of the 
result to define a. The formula would be interesting if there were 
some way to define a independent of the primes, but no one 


knows of such a definition. 


Obviously, this technique can be used to obtain a formula for 
many sequences, but it begs the question. 


18-2 Willans's Formulas 


C. P. Willans gives the following formula for the nth prime 
[Will]: 


ln 


ы 


) т и] до | + 
» | cos шинжин 
^m] X ) 


ЯГ 


p, = 1+ Σ | Wn 
1 


т 


/ 
M 


The derivation starts from Wilson's theorem, which states that p 
is prime or 1 if and only if (p -1)! = -1(modp). Thus, 


(x- 1)! +] 
x 
is an integer for x prime or x = 1 and is fractional for all 
composite x. Hence, 


(1) 


(x—- 1) + | _ | 1, x prime or 1, 
x 


F(x) = | cos? 
lo, x composite 


Thus, if л(т) denotes2 the number of primes < т, 


mm) = -1+ Y F(x). (2) 
х= | 


Observe that π(ρῃ) = n, and furthermore, 
л(т) <n, for т «pg, and 
л(т) =n, for mz pg. 


Therefore, the number of values of т from 1 to œ for which 
л(т) «nis pn- 1. That is, 


р. = 1+ Y (n(m) < n). (3) 


т = | 
where the summand is a “predicate expression” (0/1-valued). 


Because we have a formula for л(т), Equation (3) constitutes 
a formula for the wth prime as a function of n. But it has two 
features that might be considered unacceptable: an infinite 
summation and the use of a “predicate expression," which is not 
in standard mathematical usage. 


It has been proved that for n = 1 there is at least one prime 
between n and 2n. Therefore, the number of primes < 2n is at 
least n— that is, π(2π) = n. Thus, the predicate л(т) < n is 0 
for m = 2n, so the upper limit of the summation above can be 
replaced with 2n. 


Willans has a rather clever substitute for the predicate 
expression. Let 


LT(x, y) = | | for x = 0,1,2,...; y = 1,2,.... 
К N1+x 


Then, ifx < yl жу/@ +x) sy, so 17 < y y/( +x) 54/у <2, 
Furthermore, if x > y, then <y P: + 3x) «1.30 


0 S Ay/(l + x) < 1. Applying the floor function, we have 
17 for x « y, 
LT, y) = 11, for x<) 
l 0, for x 2 y, 
That is, LT(x, y) is the predicate x « y (for x and y in the given 
ranges). 

Substituting, Equation (3) can be written 


р, = 1+ s LT(z(m), n) 


т = | 


1+ τ с—— . 
ΣΙ | — 


Further substituting Equation (2) for л(т) in terms of F(x), and 
Equation (1) for F(x), gives the formula shown at the beginning 
of this section. 


Second Formula 


Willans then gives another formula: 


р„ = Y mF( m)| 2-1) - nl |. 
т = | 
Неге, Е and л are the functions used in his first formula. Thus, 
mF(m) = m if m is prime or 1, and 0 otherwise. The third factor 
in the summand is the predicate л(т) = n. The summand is 0 
except for one term, which is the nth prime. For example, 


1-1-0 + 2-1-0 + 3-1-0 + 4-0-0 + 5-1-0 + 6-0-0 + 7-1-1 


P4 
+ 8:0:1 + 9:01 + 10:01 + 11:10 + ... + 16:00 
= 7, 
Third Formula 


Willans goes on to present another formula for the nth prime 
that does not use any “nonanalytic”3 functions such as floor and 
absolute value. He starts by noting that for x — 2, 3, ..., the 


function 


| | ЧИ?” 
((x—1)!)? _ | an integer + when x is prime, 
x к i А А 
an integer, when x is composite ог 1. 


The first part follows from 


(x—-1)» (α- D! *1)-((x-1)-1),.1 
х ἃ X 
and x divides (x — 1)! + 1, by Wilson's theorem. Thus, the 
predicate “x is prime,” for x = 2, is given by 


sin?z € — D" - | лу 

Н(х) = —— _ 
. η΄ 
sin2= 
x 


From this it follows that 


m 


п(т) = У НС), for m = 2,3, .... 
x=2 


This cannot be converted to a formula Юг pn by the methods 
used in the first two formulas, because they use the floor 
function. Instead, Willans suggests the following formula4 for 
the predicate x <y, for x, y = 1: 


LT(x, v) = sin(* . 29) where 
x, (2.2), 


y-1 
е = I] (x - i). 
1-0 
Thus, if x < y, e = x (x- 1)...(0)(-1)...(x —(y - 1)) = 0, so that 
ІТ(х,у) = sin(t/2) = 1. If x = y, the product does not include 
0, so e > 1, so that ГТ(х,у) = sin((st/2) : (an even number)) = 
0. 


Finally, as in the first of Willans’s formulas, 


p, = 2+ LT(m(m), n). 
т=2 
Written out in full, this is the rather formidable 


/ 1 {12 
n-i m sin25 00 


ΠΣ--πε-- 


π 
i-0|x-2 52 = 
\ х 


"S 
| 
τ 
EH 
IMa 
un 
= 
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Fourth Formula 
Willans then gives a formula for p” + 1 in terms of ри: 


^ 
“Pa i 


Poe, τρ У ЦАР, ҮЛ. 
i=1j=1 


where f(x) is the predicate “x is composite,” for x > 2; that is, 


fx) = == 


X 


Alternatively, one could use f(x) = 1 – НО, to keep the formula 
free of floor functions. 


As an example of this formula, let pn = 7. Then, 


Poy = 1+7+Д8) + 8} 9) + f) 9)10) 

+ КУ) +... + f) 9)... (14) 
1+7+1+1:1 + 1-1-1 ΕΕ ΕΕ + 1-1-1-0-1-0-1 
= 11, 


| 


18-3 Wormell's Formula 


C. P. Wormell [Wor] improves on Willans's formulas by avoiding 
both trigonometric functions and the floor function. Wormell's 
formula can, in principle, be evaluated by a simple computer 
program that uses only integer arithmetic. The derivation does 
not use Wilson's theorem. Wormell starts with, for x = 2, 


By [1 _ Ja positive integer, if x is prime, 


q=2b=2 |0, if x is composite. 


Thus, the number of primes x m is given by 


m + (— 2B(x) 
(л) = » LLL. 
x72 - 


because the summand is the predicate “Χ is prime." 
Observe that, for n = 1, a = 0, 


n B J 0. when a < n. 
I] (l-rt+a) = ‹ 
г= | 


| a positive integer, when a = n. 


Repeating a trick above, the predicate a « n is 


1 —(-1)? 
9 


- 


(a<n) = 


Because 


p, = 2+ Y (alm) < n). 
m-2 


we have, upon factoring constants out of summations, 


Pn ~ 


ΙΕ 


1 2n 
490H—-1. z zh 2 
ж еы 139) 
m 2 
As promised, Wormell's formula does not use trigonometric 
functions. However, as he points out, if the powers of -1 were 
expanded using (-1)n = cos лп, they would reappear. 


18-4 Formulas for Other Difficult Functions 


Let us have a closer look at what Willans and Wormell have 
done. We postulate the rules below as defining what we mean by 
the class of functions that can be represented by “formulas,” 
which we will call “formula functions.” Here, X is shorthand for 


X1, X2,...,Xn for any n > 1. The domain of values is the integers 
i 22, -1, 0,1,2,.... 


1. The constants ... -1, 0, 1, ... are formula functions. 


2. The projection functions fx) = Xi for 1<i<n, are 
formula functions. 

3. The expressions x +y, x-y, and xy are formula functions, if 
x and y are. 


4. The class of formula functions is closed under composition 


(substitution). That is, Ng (A). 828), ..., g,()) is а 
formula function if f and σι are, for i = 1, ..., m. 


5. Bounded sums and products, written 


b(X) | b) ` 
> fü, x) [Π' Λι. х). 
t= а(х) 1= а(х) 
are formula functions, if a, b, and f are, and Я (х) < (х), 
Sums and products аге required to be bounded to preserve 
the computational character of formulas; that is, formulas can be 
evaluated by plugging in values for the arguments and carrying 
out a finite number of calculations. The reason for the prime on 
the X and П is explained later in this chapter. 


When forming new formula functions using composition, we 
supply parentheses when necessary according to well-established 
conventions. 


Notice that division is not included in the list above; that's 
too complicated to be uncritically accepted as a "formula 
function." Even so, the above list is not minimal. It might be fun 
to find a minimal starting point, but we won't dwell on that 
here. 


This definition of “formula function" is close to the definition 
of *elementary function" given in [Cut]. However, the domain of 
values used in [Cut] is the nonnegative integers (as is usual in 
recursive function theory). Also, [Cut] requires the bounds on 
the iterated sum and product to be 0 and x - 1 (where x is a 
variable), and allows the range to be vacuous (in which case the 
sum is defined as 0 and the product is defined as 1). 


In what follows, we show that the class of formula functions 
is quite extensive, including most of the functions ordinarily 


encountered in mathematics. But it doesn't include every 
function that is easy to define and has an elementary character. 


Our development is slightly encumbered, compared to similar 
developments in recursive function theory, because here 
variables can take on negative values. The possibility of a value's 
being negative can often be accommodated by simply squaring 
some expression that would otherwise appear in the first power. 
Our insistence that iterated sums and products not be vacuous is 
another slight encumbrance. 


Here, a "predicate" is simply a 0/1-valued function, whereas 
in recursive function theory a predicate is a true/false-valued 
function, and every predicate has an associated "characteristic 
function" that is 0/1-valued. We associate 1 with true and 0 
with false, as is universally done in programming languages and 
in computers (in what their and and or instructions do); in logic 
and recursive function theory, the association is often the 
opposite. 


The following are formula functions: 
1. a2 = aa, аЗ = ааа, and so on. 
2. The predicate a — b: 


(a — by Р 
(a=5y= 1Г0-73 


ј = 0 
3. (az b) = 1-(а = b). 
4. The predicate a = b: 


(a - by а 
Y ((a— 5) = i) 
ї= 0 


(α-- by: ((a—b)-iy 


П" -2. 


) / = 0 


(a > Б) 


Il 
ne 


We can now explain why we do not use the convention 
that a vacuous iterated sum/product has the value 0/1. If 
we did, we would have such shams as 


(a - by 5-1 
(уе 2201 and (425) = [] 0. 


1 = 0) # 


© ON σι οι 


11. 


12; 


13. 


14. 
15. 
16. 


17. 
18. 


19. 


20. 


The comparison predicates are key to everything that 
follows, and we don't wish to have them based on 
anything quite that artificial. 


. (a5 b) = (az b + 1). 

. (a x b) = (b > а). 

. (a«b) = (>а). 

. |а = (2(a = 0)-1)a. 
.max(a,b) = (azb)(a-b) + b. 
10. 


min(a, b) = (a = b)(b-a) +a. 

Now we can fix the iterated sums and products so that 
they give the conventional and useful result when the 
range is vacuous. 

b(x) "ES Е max(a(x), b(x)) 

>` SU, x) = (b(%) 2 а(х)) SAG, х). 
і = а(х) i = а(х) 

b(X) 1 - max(a(), 5(®)) _ 

[IG х) = 1+ ()2a()(71*. [[G x). 
í = а(х) i = а(х) Й 

From now оп we will use X and П without the prime. All 
functions thus defined are total (defined for all values of 
the arguments). 

п 
п! = і. 
f=1 . 

This gives n! = 1 forn < 0. 

In what follows, P and Q denote predicates. 
AP(X)- 1- Р(Х). 
Р(х) & О(х) = Р(Хх)О(Х)., 
P(x) | О(х) = 1—(1—-Р(хХ))(1— Ο(3)). 
P(X) 6 Q(x) = (Р(х) О(хХ))?.. 
if Р(Х) then JW) else 2(2) = РЖ») + (1 — P(x))gG)., 

[Ta 

ап = ifn > 0 Шеп!-1 else 0. 

This gives, arbitrarily and perhaps incorrectly for a few 
cases, the result О for n < 0, and the result 1 for 00. 


(m < Vx < п)Р(х, y) = [1 Р(х, У). 


х=т . 


(m < 3x sn)P(x, >) = 1- П (1 — Р(х, y)). 


21 
V is vacuously true; 3 is vacuously false. 


(mzminxznm)P(x,y) = m+ Σ П (1— P(j, y)). 

22. i=mj=m : 
The value of this expression is the least x in the range m 
to n such that the predicate is true, or m if the range is 
vacuous, or n + 1 if the predicate is false throughout the 
(nonvacuous) range. The operation is called “bounded 
minimalization" and it is a very powerful tool for 
developing new formula functions. It is a sort of 
functional inverse, as illustrated by the next formula. 
That minimalization can be done by a sum of products is 
due to Goodstein [Good]. 


23. L An] = (0 € min К< |n|)((k + 1) > п). 
This is the “integer square root" function, which we 
define to be 0 for n < 0, just to make it a total function. 
24. d|n = (-|n| € За < |n|)(n = qa). 
This is the *d divides n" predicate, according to which 0| 
О but —(0|n) for nz 0. 
25.n + а = ifn > 0 then (-n min qx )(0 = dr< ld -- (п 
= qd + r) else (n < ming = - n(- Id + 1 < 37 < 0) = 
qd + r). 
This is the conventional truncating form of integer 
division. For d = 0, it gives a result of \n\ + 1, 
arbitrarily. 
26. rem(n, d) = n-(n + d)d. 
This is the conventional remainder function. If rem (n, d) 
is nonzero, it has the sign of the numerator n. If d = 0, 
the remainder is n. 
27. іѕргіте(") = n2 2 & —(2 < За < |n] - 1)(d|n)., 
n 
n(n) — 
28. i= 1 isprime(i). 
(Number of primes <n.) 
29. рп = (1 < mink <2п)(л(К) = n). 
30. exponent (p, п) = (0 < minx < [n |)—(рх+1 |n). 


This is the exponent of a given prime factor p of n, for n 
> 1. 


31. For n = О: 


Š 
927 


Qn = П 2. 22" = 2. 22" = П 2, etc. 
= | i=l 


32. The nth digit after the decimal point in the decimal 
expansion of 4/2: rem (L2 - 10" J, 10), 
Thus, the class of formula functions is quite large. It is 
limited, though, by the following theorem (at least): 


THEOREM. If f is a formula function, then there is a constant k 
such that 


where there are k 2's. 


This can be proved by showing that each application of one 
of the rules 1-5 (on page 398) preserves the theorem. For 
example, if ДХ) = c (rule 1), then for some h, 


Дх) S227 th, 


where there are h 2’s. Therefore, 


| 4 «тах | "μη 
Дх) < 22° ' yh +2, 
because тах(|х |, ..., |хп |) = 0. 


For /(X) = X, (rule 2), /%) < max (pal, ..., [xn |), so the 
theorem holds with k = 0. 


For rule 3, let 


Then, clearly 


этаж i|. р 
` " ymax(K,, k) 


таах, |, r i} 
3 1 


| t max(K,, А.) + 1. 
Similarly, it can be shown that the theorem holds for f(x, y) 
= xy. 
The proofs that rules 4 and 5 preserve the theorem are a bit 
tedious, but not difficult, and are omitted. 


From the theorem, it follows that the function 


Дх) = 2255 }х (4) 
is not a formula function, because for sufficiently large x, 


Equation (4) exceeds the value of the same expression with any 
fixed number k of 2’s. 


For those interested in recursive function theory, we point 
out that Equation (4) is primitive recursive. Furthermore, it is 
easy to show directly from the definition of primitive recursion 
that formula functions are primitive recursive. Therefore, the 
class of formula functions is a proper subset of the primitive 
recursive functions. The interested reader is referred to [Cut]. 


In summary, this section shows that not only is there a 
formula in elementary functions for the nth prime but also for a 
good many other functions encountered in mathematics. 
Furthermore, our “formula functions” are not based on 
trigonometric functions, the floor function, absolute value, 
powers of -1, or even division. The only sneaky maneuver is to 
use the fact that the product of a lot of numbers is 0 if any one 
of them is 0, which is used in the formula for the predicate a = 
b. 


It is true, however, that once you see them, they are not 
interesting. The quest for “interesting” formulas for primes 
should go on. For example, [Rib] cites the amazing theorem of 
W. H. Mills (1947) that there exists a 0 such that the expression 


ο 
is prime-valued for all n = 1. Actually, there are an infinite 
number of such values (e.g, 1.3063778838 + and 


1.4537508625483 +). Furthermore, there is nothing special 
about the “3”; the theorem is true if the 3 is replaced with any 
real number > 2.106 (for different values of Ө). Better yet, the 3 
can be replaced with 2 if it is true that there is always a prime 
between n2 and (n + 1)2, which is almost certainly true, but has 
never been proved. And furthermore, ... well, the interested 
reader is referred to [Rib] and to [Dud] for more fascinating 
formulas of this type. 


Exercises 


1. 


Prove that for any non-constant polynomial f(x) with 
integral coefficients, |f(x) | is composite for an infinite 
number of values of x. 


Hint: If f(xo) = К, consider (хо + rk), where г is an 
integer greater than 1. 


. Prove Wilson's theorem: An integer p > 1 is prime if and 


only if 
(p- 1)! = -1 (mod p). 


Hint: To show that if p is prime, then (р-1)! = -1 (mod 
р), group the terms of the factorial in pairs (a, b) such 
that ab = 1 (mod p). Use Theorem MI of Section 10-16 
on page 240. 


. Show that if n is a composite integer greater than 4, then 


(n-1)!=0 (mod n). 


. Calculate an estimate of the value of 0 that satisfies Mills's 


theorem, and in the process give an informal proof of the 
theorem. Assume that for n > 1 there exists a prime 
between n? and (n + 1). (This depends upon the 
Riemann Hypothesis, although it has been proved 
independent of RH for sufficiently large n.) 


. Consider the set of numbers of the form à + 54/—5, where 


a and b are integers. Show that 2 and 3 are primes in this 
set; that is, they cannot be decomposed into factors in the 
set unless one of the factors is +1 (a “unit”). Find a 
number in the set that has two distinct decompositions 
into products of primes. (The “fundamental theorem of 
arithmetic" states that prime decomposition is unique 


except for units and the order of the factors. Uniqueness 
does not hold for this set of numbers with multiplication 
and addition being that of complex numbers. It is an 
example of a “ring.”). 


Answers To Exercises 


Chapter 1: Introduction 


1. The following is pretty accurate: 


ej} 
while (e>) 4 
statement 
€3;) 
If e2 is not present in the for loop, the constant 1 is used 
for it in the above expansion (which would then be a 
nonterminating loop, unless something in statement 
terminates it). 

Expressing a for loop in terms of a ao loop is 
somewhat awkward, because the body of a ao loop is 
always executed at least once, whereas the body of a for 
loop may not be executed at all, depending on e1 and e». 
Nevertheless, the for loop can be expressed as follows. 


el; 
if (ez) { 
do (statement; ез;} while (e2); 


Again, if e2 is not present in the for loop, then use 1 for 
it above. 


2. If your code is 
Click here to view code image 
for (i = 0; i <= OxFFFFFFFF; i++) {...} 
then you have an infinite loop. A loop that works is 
Click here to view code image 


i = OxFFFFFFFF; 
do {i = i + 1;...) while (i < OxFFFFFFFF); 


3. The text mentions multiply, which for 32 x 32 = = < 64- 


bit multiplication needs two output registers. 


It also mentions divide. The usual implementation of 
this instruction produces a remainder as well as the 
quotient, and execution time would be saved in many 
programs if both results were available. 


Actually, the most natural machine division operation 
takes a doubleword dividend, a single word divisor, and 
produces a quotient and remainder. This uses three source 
registers and two targets. 


Indexed store instructions use three source registers: 
the register being stored, the base register, and the index 
register. 


To efficiently deal with bit fields in a register, many 
machines provide extract and insert instructions. The 
general form of extract needs three sources and one target. 
The source registers are the register that contains the field 
being extracted, a starting bit number, and an ending bit 
number or length. The result is right justified and either 
zero- or sign-extended and placed in the target register. 
Some machines provide this instruction only in the form 
in which the field length is an immediate quantity, which 
is a reasonable compromise because that is the common 
case. 


The general insert instruction reads four source 
registers and writes one target register. As commonly 
implemented, the sources are a register that contains the 
source bits to be inserted in the target (these come from 
the low-order end of the source register), the starting bit 
position in the target, and the length. In addition to 
reading these three registers, the instruction must read the 
target register, combine it with the bits to be inserted, and 
write the result to the target register. As in the case of 
extract, the field length may be an immediate quantity, in 
which case the instruction does three register reads and 
one write. 


Some machines provide a family of select instructions: 
Click here to view code image 


SELcc RT,RA,RB,RC 


Register RC is tested, and if it satisfies the condition 
specified in the opcode (shown as cc, which may be zo, 
GT, GE, etc), then RA is selected; otherwise, RB is 
selected. The selected register is copied to the target. 


Although not common, a plausible instruction is bit 
select, or multiplex: 


Click here to view code image 


MUX RT,RA,RB,RC 


Here RC contains a mask. Wherever the mask is 1, the 
corresponding bit of RA is selected, and wherever it is O, 
the corresponding bit of RB is selected. That is, it 
performs the operation 


Click here to view code image 


RT <-- RA & RC | RB & «RC 


Shift right/left double: A sometimes useful instruction 
is 
Click here to view code image 


SHLD RT,RA,RB,RC 


This concatenates RA and RB, treating them as a double- 
length register, and shifts them left (or right) by an 
amount given by RC. RT gets the part of the result that 
has bits from RA and RB. These instructions are useful in 
“bignum” arithmetic and in more mundane situations. 


In signal processing and other applications, it is 
helpful to have an instruction that computes A*B + C. 
This applies to both integer and floating-point data. 


Of course, there are load multiple and store multiple, 
which require many register reads or writes. Although 
many RISCs have them, they are not usually considered to 
be RISC instructions. 


Chapter 2: Basics 


1. (Derivation by David de Kloet) Clearly the body of the 
while-loop is executed a number of times equal to the 


number of trailing 0’s in x. The k 1-bits partition the n-bit 
word into К + 1 segments, each containing 0 or more 0- 
bits. The number of 0’s in each word is n-k. If N is the 
_ (m 

number of words ` νά, but that need not concern us 
here), then the total number of 0’s in all the words is N(n 
— k). By symmetry, the number of 0’s in any segment, 
summed over all N words, is the same, and is therefore 
equal to N(n — k)/(k + 1). Thus, the average number of 
075 іп any segment is (n-k)/(k 1), and this applies to the 
last segment, which is the number of trailing 0’s. 


As an example, if n — 32 and k — 3, then the while- 
loop is executed 7.25 times, on average. On many 
machines the while-loop can be implemented in as few as 
three instructions (and, shift right, and conditional branch), 
which might take as few as four cycles. With these 
parameters, the while-loop takes 427.25 = 29 cycles on 
average. This is less than the divide time on most 32-bit 
machines, resulting in de Kloet’s algorithm being faster 
than Gosper’s. For larger values of k, de Kloet’s is still 
more favorable. 


2. The and with 1 makes the shift amount independent of all 
bits of x except for its rightmost bit. Therefore, by 
looking at only the rightmost bit of the shift amount, one 
can ascertain whether the result is x or x << 1. Since both 
x and x << 1 are right-to-left computable, choosing one of 
these based on a rightmost bit is also. The function x << 
(x & 2), incidentally, is not right-to-left computable. But 
(x & -2) << (x & 2) is. 


Another example is the function xn, where we take x0 
to be 1. This is not right-to-left computable because if x is 
even, then the rightmost bit of the result depends upon 
whether or not x = 0, and thus is a function of bits to the 
left of the rightmost position. But if it were known a priori 
that the variable n is either O or 1, then x" is right-to-left 
computable. Similarly, хп&1 is right-to-left computable, 
for example, by 


11. neven. 


х"&1 = (x & -(n&1))+1-(n&1) = < 
х, n odd. 


Notice that xn is like the left shift function in that x” is 
right-to-left computable for any particular value of n, or if 
n is a variable restricted to the values 0 and 1, but not if п 
is an unrestricted variable. 


3. A somewhat obvious formula for addition is given on page 
16, item (g): 


x + y = (x @ y) +2(x & y). 


Dividing each side by 2 gives Dietz's formula. The 
addition in Dietz's formula cannot overflow because the 
average of two representable integers is representable. 


Notice that if we start with item (i) on page 16, we 
obtain the formula given in the text for the ceiling 
average of two unsigned integers. 


ЕУ = (х | y)-(x@y) $1) 


4. Compute the floor average of a and b, and also of c and d, 
using Dietz's formula. Then compute the floor average of 
x and y, and apply a correction: 


x = (a&b) + ((a 9 b) = 1), 


y = (c&d) + ((cO а)» 1), 


r=(x&y)+((x Фу) >> 1), 


r=rt((a®@b&(cOd& (x Py) & 1). 


The correction step is really four operations, not the 
seven that it appears to be, because the exclusive or terms 
were calculated earlier. It was arrived at by the following 
reasoning: The computed value of x can be lower than the 
true (real number) average by 1/2, and this error occurs if 
a is odd and b is even, or vice versa. This error amounts 
to 1/4 after x and y are averaged. If this were the only 
truncation error, the first value computed for r would be 
correct, because in this case the true average is an integer 
plus 1/4, and we want the floor average, so we want to 
discard the 1/4 anyway. Similarly, the truncation in 
computing y can make the computed average lower than 


the true average by 1/4. The first computed value of r can 
be lower than the true average of x and y by 1/2. These 
errors accumulate. If they sum to an error less than 1, 
they can be ignored, because we want to discard the 
fractional part of the true average anyway. But if all three 
errors occur, they sum to 1 / 4 + 1 / 4 + 1 / 2 = 1, 
which must be corrected by adding 1 to r. The last line 
does this: if one of a and b is odd, and one of c and d is 
odd, and one of x and y is odd, then we want to add 1, 
which the last line does. 


5. The expression for Х < Y to be simplified is 
(эх | y) & (х Фу) | —(y - х)). 


Only bit 31 of x and y is relevant in the logical 
operations of this expression. Because уз1 = 0, the 
expression immediately simplifies to 


эх & (x | -(γ- x)). 
“Multiplying in" the —х (distributive law) gives 
-х&-(у - x), 


and applying De Morgan's law further simplifies it to 
three elementary instructions: 


^(x | (y - х)). 


(Removing the complementation operator gives a two- 
instruction solution for the predicate ¥ > У.) 

If y is a constant, we can use the identity -и= -1-u 
to rewrite the expression obtained from the distributive 
law as 


эх & (x - (y + 1)), 


which is three instuctions because the addition of 1 to y 
can be done before evaluating the expression. This form 
is preferable when y is a small constant, because the add 
immediate instruction can be used. (Problem suggested by 
George Timms.) 


6. To get a carry from the second addition, the carry from 


the first addition must be 1, and the low-order 32 bits of 
the first sum must be all 1’s. That is, the first sum must be 
at least 233 – 1. But the operands are each at most 232 — 
1, so their sum is at most 233 — 2. 


7. For notational simplicity, let us consider a 4-bit machine. 
Let x and y denote the integer values of 4-bit quantities 
under unsigned binary interpretation. Let f (x, y) denote 
the integer result of applying ordinary binary addition 
with end-around carry, to x and y, with a 4-bit adder and 
a 4-bit result. Then, 


fx,y) = mod(x + y + БЭ 16). 


0001 
0010 
0011 
0100 
0101 
0110 
0111 
1000 
1001 
1010 
1011 
1100 
1101 
1110 
1111 


) 
| 
2 
3 
4 
5 
6 
7 

7 

-6 

5 

-4 


| 
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— 


The table at the right shows the one's-complement 
interpretation of 4-bit binary words. Observe that the 
one's-complement interpretation of a word whose straight 
binary interpretation is x is given by 


[х, 0х7, 
Їх-15, 8<х< 15. 


We must show that f (x, y), when interpreted as a 


ones(x) = ` 


one's-complement integer, is the sum of x and y when 
they are interpreted as one's-complement integers. That 
is, we must show that 


ones(x) + ones(y) = ones(f (x, y)). 


We are interested only in the non-overflow cases (that is, 
when the sum can be expressed as a one's-complement 
integer). 


Case 0, 0 x x, y < 7. Then, ones(x) + ones(y) = x + y, 
and 


f x, y) = mod(x + y + 0,16) = x + y. 


For no overflow, the one's-complement result must be in 
the range 0 to 7, and from the table it is apparent that we 
must have x + y < 7. Therefore, ones(x + y) = x + y. 


Case 1, 0 < x < 7, 8 < y < 15. Overflow cannot 


occur because ones(x) = O and ones(y) x O. In this case, 
ones(x) + опез(у) = x + y 15. If x + y < 16, 


f x, y) = mod(x + y + 0,16) = х + y. 


In this case x + y must be at least 8, so ones(x + y) = x 
+ y - 15. On the other hand, if x + y = 16, 


ҒО, у) =mod(x + у + 1, 16) =х+у+1-16 =х+у- 
15. 


Because x + y is at most 22 and is at least 16, 1 < х + y 
-15 < 7, so that ones(x + y- 15) = х + y- 15. 

Case 2,8 < x < 15,0 < y < 7. This is similar to case 
1 above. 

Case 3, 8 < x < 15,8 < y < 15. Then, ones(x) + 
ones(y) = x- 15 + y -15 = x + y 30, and 


ҒО, у) = mod(x + у + 1, 16) =х+у+1-16 =х+у- 
15. 


Because of the limits оп x and у, 16 < x + у < 30. То 
avoid overflow, the table reveals that we must have x + 
y = 23. For, in terms of one's-complement interpretation, 


we can add -6 and -1, or —6 and -0, but not -6 and --2, 
without getting overflow. Therefore, 23 < x + y x 30. 
Hence 8 < x + y- 15 < 15, so that ones(x + y - 15) = 
х + y- 30. 

For the carry propagation question, for one’s- 
complement addition, the worst case occurs for something 
like 


Click here to view code image 


11. 


000...0011 
t 1 (end-around carry) 
000...0100 


for which the carry is propagated n places, where n is the 
word size. In two's-complement addition, the worst case 
is n -- 1 places, assuming the carry out of the high-order 
position is discarded. 


The following comparisons are interesting, using 4-bit 
quantities for illustration: In straight binary (unsigned) or 
two's-complement arithmetic, the sum of two numbers is 
always (even if overflow occurs) correct modulo 16. In 
one's-complement, the sum is always correct modulo 15. 
If xn denotes bit n of x, then in two's-complement 
notation, x = -8x3 + 4х2 + 2x4 + хо. In one’s- 
complement notation, x = -7x3 + 4х2 + 2x1 + хо. 


. (x Ө y)& m) Фу. 
. x @ y = (x | y) & «(x & y). 
10. 


[Arndt] Variable t is 1 if the bits differ (six instructions). 


t€ ((x»i)9(xj)&l 
x< x 6 (t «j) 
Adding the line x — x © (t << i) makes it swap bits i and j 


As described in the text, any Boolean function f(x1, x2,..., 
Xn) сап be decomposed into the form g(x1, Х2,..., Xn-1)8Xn 
h (x1,x2,...,Xn—1,). Let c (n) be the number of instructions 


required for the decomposition of an n-variable Boolean 
function into binary Boolean instructions, for n > 2. Then 


Cn+1 = 2сп + 2, 
with сә = 1. This has the solution 
Cn = 3-2п -2 -2. 


(The least upper bound is much smaller.) 
12. (a) 


Ax, у,2) = 2fo(x, у) + 27 (х, у) 
= Zfyx,y) Ө zf\(x, y) 
= Zf(x,y) (19€ z)/,(x, y) 
= Σχ, y) Ө fix, y) 9 Zfi(x, y) 
= fix, y) 6 z(fox, y) © fix, y). 
which is in the required form. 
(b) From part (a), 


fix, y, 2) = fio, y) Ө £o y) 9 fix, y)) 


= fix, y) 9 (z + (Дух, y) Ө fix, y))), 

which is in the required form. 

13. Using the notation of Table 2-3 on page 54, the missing 
functions can be obtained from 0000 = andc (x, x), 0011 
— and (x, x), 0100 — andc (y, x), 0101 — and (y, y), 1010 
= nand (y, y), 1011 = cor (y, x), 1100 = nand (x, x), and 
1111 - cor (x, x). 

14. No. The ten truly binary functions are, in numeric form, 


0001 0010 0100 0110 0111 

1000 1001 1011 1101 1110 
By implementing function 0010 you get 0100 Бу 
interchanging the operands, and, similarly, 1011 yields 
1101. That's all you can accomplish by interchanging the 


15. 


16. 


operands, because the other functions аге commutative. 
Equating the operands, of course, reduces a function to a 
constant or unary function. Therefore, you need eight 
instruction types. 


The table below shows one set of six instruction types that 
accomplish the task. Here, x denotes the contents of the 
register operand, and k denotes the contents of the 
immediate field. 


SIX SUFFICIENT R-I BOOLEAN INSTRUCTIONS 


Function Instruction 
Values Formula Mnemonic 
0001 xk and 
0111 xt+k or 


0110 хФ κ xor 


1110 == nand 


1000 — nor 


0101 + const 


The missing functions can be obtained from 0000 — 

and (x, 0), 0010 = and (x, К), 0011 = or (x, 0), 0100 = 
nor (х, К), 1001 = xor (x, К), 1010 = const (x, К), 1011 = 
or (x, К), 1100 = nor (x, 0), 1101 = папа (x, Kk), and 
1111 = nand (x, 0). 
This writer does not know of an “analytic” way to do this. 
But it is not difficult to write a program that generates all 
Boolean functions of three variables that can be 
implemented with three binary instructions. Such a 
program is given in C below. It is written in as simple a 
way as possible to give a convincing answer to the 
question. Some optimizations are possible, which are 
mentioned below. 


The program represents a function by an 8-bit string 
that is the truth table of the function, with the values for 
x, y, and z written in the usual way for a truth table. Each 


time a function is generated, it is checked off by setting a 
byte in vector found to 1. This vector is 256 bytes long 
and is initially all zero. 


The truth table that the program works with is shown in 
the table below. 


TRUTH TABLE FOR THREE VARIABLES 


The six columns of the truth table are stored in a 
vector fun. The first three positions of tun contain the 
truth table columns for x, y, and z. These columns have 
the values hexadecimal OF, 33, and 55, which represent 
the trivial functions f (x, у, z = x, f (x, y, z) = у, and f 
(x, y, 2) = z. The next three positions will contain the 
truth table columns for the functions generated by one, 
two, and three binary instructions, respectively, for the 
current trial. 


The program conceptually consists of three nested 
loops, one for each instruction currently being tried. The 
outermost loop iterates over all 16 binary Boolean 
operations, operating on all pairs of x, y, and z (16*3*3 — 
144 iterations). For each iteration, the result of operating 
on all eight bits of x, y, and/or z in parallel is put in 
£un[3]. 


The next level of looping similarly iterates over all 16 
binary Boolean operations, operating on all pairs of x, y, 
z, and the result of the outermost loop (16*4*4 — 256 
iterations). For each iteration, the result is put in £un[4]. 


The innermost level of looping similarly iterates over 
all 16 binary Boolean operations, operating on all pairs of 


x, y, z, and the results of the outer two loops (16*5*5 — 
400 iterations). For each of these calculated functions, the 
corresponding byte of found is set to 1. 


At the end, the program writes out vector found in 16 
rows of 16 vector elements each. Several positions of 
vector found are 0, showing that three binary Boolean 
instructions do not suffice to implement all 256 Boolean 
functions of three variables. The first function that was 
not calculated is number 0x16, or binary 00010110, 
which represents the function Xyz + xVz + ху2. 


There are many symmetries that could be used to 
reduce the number of iterations. For example, for a given 
operation op and operands x and y, it is not necessary to 
evaluate both op(x, y) and op(y, x), because if op(x, y) is 
evaluated, then op(y, x) will result from op'(x, y) where 
op’ is another of the 16 binary operations. Similarly, it is 
not necessary to evaluate op(x, x), because that will be 
equal to op'(x, y) for some other function op’. Thus, the 
outermost loops that select combinations of operands to 
try could be written 


Click here to view code image 


for (11 = 0; il < 3; IIF) 4 
for (12 = 11 + 1; 12 < 3; 1244) 4 


and similarly for the other loops. 


Another improvement results from observing that it is 
not necessary to include all 16 binary Boolean operations 
in the table. The operations numbered 0, 3, 5, 10, 12, and 
15 can be omitted, reducing the loops that iterate over the 
operations from 16 to ten iterations. The argument in 
support of this is a little lengthy and is not given here. 


The program can be easily changed to experiment 
with smaller instruction sets, or allow more instructions, 
or handle more variables. But be forewarned: The 
execution time increases dramatically with the number of 
instructions being allowed, because that determines the 
level of nesting in the main program. As a practical 
matter, you can't go beyond five instructions. 


Click here to view code image 


/* Determines which of the 256 Boolean functions of three 
variables can be implemented with three binary Boolean 
instructions if the instruction set includes all 16 binary 
Boolean operations. */ 


#include <stdio.h> 
char found[256]; 


unsigned char boole(int op, unsigned char x, 
unsigned char y) { 


switch (op) { 
case 0: return 0; 
case 1: return x & y; 
case 2: return х & ~y; 
case 3: return x; 
case 4: return ~x & y; 
case 5: return y; 
case 6: return x ^ y; 
case 7: return x | y; 
case 8: return ~(x | y); 
case 9: return ~(x ^ y); 
case 10: return ~y; 
case 11: return x | ~y; 
case 12: return ~x; 
case 13: return ~x | y; 


case 14: return ~(x & y); 
case 15: return OxFF; 


} 
#define NB 16 // Number of Boolean operations. 
int main() { 


int i, j, ol, il, i2, o2, jl, 12, 03, kl, k2; 

unsigned char fun[6];// Truth table, 3 columns for 
// x, y, and z, and 3 columns 
// for computed functions. 


fun[0] = OxOF; // Truth table column for x, 
fun[1] = 0x33; // y, 

fun[2] = 0x55; // and z. 

for (ol = 0; ol < NB; о1++) í 


0 
for (il—.0; il « 3$ απ) 4 
for (12 = 0; 12 < 3; 12++) { 


fun[3] = boole (ol, fun[il], fun[i2]); 
for (02 = 0; 02 < NB; о2++) { 

for (jl = 0; j1« 4; 1144) { 

for (32 = 0; j2 < 4; 1244) { 


fun[4] = boole(o2, fun[j1], fun[j2]); 
for (032 0; o3 < NB; о3++) { 
for (kl = 0; kl < 5; κι) { 
for (k2 = 0; k2 < 5; k2++) { 
fun[5] = boole(o3, fun[kl], fun[k2]); 
found[fun[5]] = 1; 
11] 
11] 
}}} 
printf("0 1 2 34 5 6 78 9ABCODE F\n"); 
for (i = 0; i < 16; i++) { 
printf("$X", i); 
for (1 = 0; } < 16; J++) 
printf ("524", found[16*i + j]); 
printf("Nin"); 


) 


return 0; 


All ternary Boolean functions computable with three 
instructions, continued. 


Chapter 3: Power-of-2 Boundaries 


1. (a) (x + 4) & -8. 
(b) (x + 3) ἃς -8. 
(c) (х +3 + (((x 3) & 1) & -8. 

Part (c) can be done in four instructions if the extract 
instruction is available; it can do ((* 33) & 1 in one 
instruction. 

Note: Unbiased rounding preserves the average value 
of a large set of random integers. 


2. The standard way to do part (a) is ((¥ + 5) + 10) * 10, If 
the remainder function is readily available, it can also be 
done with x + 5 - remu(x + 5, 10), which saves a 
multiplication at the expense of an addition. 


Part (b) is similar, but replace the 5 with 4 in the 
answer for part (a). 


Part (c): Use the fact that an integer is an odd multiple 
of 10 if and only if it is an odd multiple of 2. 


Click here to view code image 


r = x $ 10; 
= x - r; 
if (r > 5 | (r == 5 & (y & 2) != O) 
у = у + 10; 


An alternative (must have х < 232 - 6): 


Click here to view code image 


r= (x + 5)$10; 

y=zx+5- r} 

if (r == 0 & (y & 2) != 0) 
y = y = 10; 


3. A possible implementation in C is shown below. 
Click here to view code image 
int loadUnaligned(int *a) { 


int *alo, *ahi; 
int xlo, xhi, shift; 


alo = (int *)((int)a & -4); 

ahi = (int *)(((int)a + 3) & -4); 

xlo - *alo; 

xhi = *ahi; 

shift = ((int)a & 3) << 3; 

return ((unsigned)xlo >> shift) | (xhi <<; (32- 
shift) ); 


} 
Chapter 4: Arithmetic Bounds 


1. For a = c = 0, inequalities (5) become 
0О<х-у< 232-1 if -4-0 and b20, 


-d*x-—y*tb otherwise. 
Because the quantities are unsigned, -а < 0 is 


equivalent to d = 0, and b = 0 is true. Therefore, the 
inequalities simplify to 


О<х-у< 232-1 if dO, 


0Otx-ytb if d = 0. 
This is simply the observation that if d= 0, then y= 0 


and so, trivially, 0 < x- yx b. On the other hand, if d = 
0, then the difference can attain the value 0 by choosing 
x= y= 0, and it can attain the maximum unsigned 
number by choosing x= 0 and y= 1. 


2. If a = O, the test if (temp >= a) is always true. 
Therefore, when the first position (from the left) is found 
in which the bits of ь and a are 1, the program sets that 
bit of ь equal to о and the following bits equal to 1, and 
returns that value or'ed with a. This can be accomplished 
more simply with the following replacement for the body 
of the procedure. The if statement is required only on 
machines that have mod 32 shifts, such as the Intel x86 
family. 


Click here to view code image 


temp = nlz(c & d); 


if (temp == 0) return OxFFFFFFFF; 
m = 1 << (32 - temp); 
return b | а | (m = 1); 


For example, suppose 


0 <x < 0601001000. and 
0b00000011 < y < 0500101010. 


Then to find the maximum value of x| y, the procedure is 
to scan from the left for the first position in which b and 
d are both 1. The maximum value is c| d for bits to the 
left of that position, and 1’s for bits at and to the right of 
that position. For the example, this is 0b01001000 | 
0b00101010 | 0b0000 1111 — 0b0110 1111. 


Chapter 5: Counting Bits 
1. A version from Norbert Juffa: 
Click here to view code image 
int ntz (unsigned int n) { 
static unsigned char tab[32] = 
( 0, 1, 2, 24, 3, 19, 6, 25, 


22, 4, 20, 10, 16, 7, 12, 26, 
31, 23, 18) 5, 2l, 9, 15, Ii, 


}; 


unsigned int К; 


n=n & (-n); /* isolate lsb */ 
dif defined (SLOW MUL) 

к= (п << 11) = n; 

k = (К << 2) + k; 

k = (К << 8) + n; 

k = (k << 5) - k; 
#else 

К =n * 0х4а7651Е; 
#fendif 

return n ? tab[k>>27] : 32; 


} 


2. X £(x & х), This is used in the snoob function (page 15). 

3. Denote the parallel prefix operation applied to x by PP- 
XOR(x). Then, if У = PP-XOR(x), x = y € (y % 1), То 
see this, let x be the 4-bit quantity abcd (where each 
letter denotes a single bit). Then 


y = PP-XOR(x) = (a)(a 9 b)(a 8G b G c)(a @ b @ c @ d), 
y 1-7 (0)(а)(а Θ b)(aG b @ c). so that 


y (y $1) = (a)(b)(c)(a). 


For the parallel suffix operation, if y= PS-XOR(x) then, 
as you might guess, x= y @ (y« < 1). 


Chapter 6: Searching Words 


1. Length and position of the longest string of 1’s (c.f. 
Norbert Juffa): 


Click here to view code image 
int fmaxstrl(unsigned x, int *apos) { 


int k; 
unsigned oldx; 


*apos - nlz(oldx); 
return k; 


. As said in the text, this can be done by first left- 


propagating the O's in x by n - 1 positions, and then 
finding the shortest string of 1’s in the revised x. A good 
way to do the left-propagation is to use the code of Figure 
6-5 on page 125, which is logarithmic in its execution 
time. (But the second part of the algorithm is linear in the 
length of the shortest string of 1’s in the revised x.) The 
code is shown below. It assumes that 1 < n < 32. In the 
"not found" case, the function returns with apos — 32. In 
this case, the length should be regarded as undefined, but 
it happens to return a length of n - 1. 


Click here to view code image 


int bestfit(unsigned x, int n, int *apos) { 
int m, 5; 


m 


=n; 


while (m > 1) 4 


} 


s = m >> 1; 
x & (x << 8); 
m = m = S; 


x 


return fminstrl(x, apos) + n - 1; 


3. 


The code below uses an expression from page 12 for 
turning off the rightmost contiguous string of 1’s. 


Click here to view code image 


int fminstrl(unsigned x, int *apos) { 
int κ, kmin, у0, y; 
unsigned int x0, xmin; 


kmin = 32; 

yO = pop(x); 

x0 = x; 

do { 
x = ((x & -X) + X) & xi // Turn off rightmost 
y = pop(x); // string. 


k = yO - y; // k = length of string 


if (К <= kmin) 4 // turned off. 
kmin = k; // Save shortest length 
xmin = x; // found, and the string. 
} 
yO = y; 
} while (x != 0); 
*apos = nlz(x0 ^ xmin); 


return kmin; 


The function executes in 5 + 11n instructions, where 
n is the number of strings of 1’s in x, for n = 1 (that is, 
for x = 0.) This assumes the if-test goes either way half 
the time, and that pop(x) and nlz(x) count as one 
instruction each. By making changes to the sense of the 
"if (k <= kmin)" test, and to the initialization of kmin, it 
can be made to find the longest string of 1’s, and either 
the leftmost or the rightmost in the case of equally long 
strings. It is also easily modified to perform the “best fit" 
function. 


4. The first bit of x will be 1, and hence mark the beginning 
of a string of l's, with probability 0.5. Any other bit 
marks the beginning of a string of 1’s with probability 
0.25 (it must be 1, and the bit to its left must be 0). 
Therefore the average number of strings of 1’s is 0.5 + 
31:0.25 = 8.25. 


5. One would expect the vast majority of words, if they are 
fairly long, to contain a string of 1’s of length 1. For, if it 
begins with 10, or ends with 01, or contains the string 
010, then its shortest contained string of 1’s is of length 
1. Therefore the average length is probably just slightly 
more than 1. 


An exhaustive check of all 232 words shows that the 
average length of the shortest string of 15 is 
approximately 1.011795. 


6. (Solution by John Gunnels) This problem is surprisingly 
difficult, but the technique used is a good one to know. 
The solution is based on a recursion that counts the 
number of words in each of four sets, as shown in the 
table below. In this table, “singleton” means a string of 


I's of length 1, “nnn” denotes a string of length = 0 that 
does not contain a singleton, and “sss” means a string of 
length = 1 that contains a singleton. The ellipsis means 0 
or more of the preceding bit. Every binary word is in one 
and only one of these four sets. 


Words of the Form | Description 

nnnO... or null Does not have a singleton, but might at the next step 
nnnOl or 1 Has a singleton, but might not at the next step 
nnnO11... or 11.. Does not have a singleton, and will not at the next step 
8880 or $5$01... Has a singleton, and will at the next step 


At each step, a bit is appended to the right-hand end 
of the word. As this is done, a word moves from one set to 
another as shown below. It moves to the left alternative if 
a O is appended, and to the right alternative if a 1 is 
appended. 


А > Aor B 
В= D or C 
C= A or C 
D— Гог D 


For example, the word 1101 is in set B. If a O is 
appended, it becomes 11010, which is in set D. If a 1 is 
appended, it becomes 11011, which is in set C. 


Let απ, bn, сп, and dn denote the sizes of sets A, B, C, 
and D, respectively, after n steps (when the words are of 
length n). Then 


an+ = Ay T Cy. 

b, + | - ам 

C,+1 = D, tc,, and 
Bl 7 b АЙ, 


This is because set A at step n + 1 contains every 
member of set A at step n, with a 0 appended, and also 
every member of set C at step n, with a 0 appended. Set B 
at step n + 1 contains only every member of set A at step 


n, with a 1 appended, and so on. 
The initial conditions are ay = 1 and Бо = со = do = 
0. 


It is a simple matter to evaluate these difference 
equations with a computer program or even by hand. The 
result, forn = 32, is 


аз) = 26,931,732, 

δ; = 15,346,786, 

ο = 20,330,163, 

di, = 4,232,358,615, and 
b,, +d}, = 4,247,705,401. 


The last line gives the number we are interested in—the 
number of words for which their shortest contained string 
of 1’s is of length 1. It is about 98.9 percent of the 
number of 32-bit words (232). 

What about a closed-form solution? This is also 
difficult to obtain. We will just sketch a solution. 

Let en = bn + dy, which is the quantity we desire to 
find. Then, from the difference equations, and using the 
fact that an + bn + сп+ dn = 2n, 


27 - b, T а, 
= In _ = я 
~ а= C, 
= JN 
2 Gy +1: 


Thus, if we can find a closed-form formula for ап, we will 
have one for ep. 

We can find a single-variable difference equation for 
ал as follows. From the difference equations, 


Gn παν ιτ C4. | 
= à,.| tb, 3t 62 
ша, 1T, 3706, 2 
απ. + аһ-3 + d, - 17 05-2 
= 2a, -] _ q, -2 + 4, -3 


This difference equation can be solved by well-known 
methods. The process is a bit lengthy and messy and 
won't be gone into here. It involves the solution of a cubic 
polynomial that has two complex roots. When combined 
with the equation for ёл, we obtain, approximately, 


e, = 2^ 0.41150 - 1.7549" +1 
- (0.29425 — 0.138111) (0.12256 + 0.744861)" +! 
- (0.29425 + 0.138117 (0.12256 - 0.744861)" + ! 


If n is an integer, the imaginary parts cancel out, which is 
not hard to prove. (Hint If x and y are complex 
conjugates, then so are х" and yn.) 


We can get a formula involving only real numbers. 
The real part of the second term of the formula above is 
certainly less than 


[0.29425 - 0.13811 i |-|0.12256 + 0.74486i |n +1 
which is, for n = 0, 
0.32505 - 0.75488 = 0.24537, 


and is still smaller for n > 0. The same holds for the last 
term of the equation for en. Therefore the real part of the 
last two terms sum to less than 0.5. Since e; is known a 
priori to be an integer, this means that e; is given by the 
first term rounded to the nearest integer, or 


e, x | 27 – 0.41150 - 1.75497 + ! + 0.5 |, 


7. Briefly, this problem can be solved by using 10 sets of 
words, described below. In this table, ^nnn" denotes a 
string of length 2 0 whose shortest contained string of 


15 is of length 0 or is = 3, “ddd” denotes a string of 
length = 2 whose shortest contained string of 1’s is of 
length 2, and "sss" denotes a string of length 2 1 whose 
shortest contained string of 1’s is of length 1. (The sets 
keep track of the words that contain a singleton at a 
position other than the rightmost, because such words 
will never have a shortest contained string of 1’s of length 
2.) The ellipsis means 0 or more of the preceding bit. 


Words of the Form 5 Words of the Form 
nnnO... or null " 44401 
пппО1 or 1 1 444011 


ΠΠΠΟΤΙ ог 11 4990111... 
nnnOT111... or 111... sssÜ 


4440 Е 55501... 


At each step, as а bit is appended to the right-hand 
end of a word from one of these sets, it moves to another 
set as shown below. It moves to the left alternative if a 0 
is appended, and to the right alternative if a 1 is 
appended. 


А > Aor B F = I or (z 
B-—lorC G= E or H 
G= Сог 1 H-»EorH 
Ю= 4 ог [=> Гог 
F: = E or F > Lor J 


Let ал, bn, ..., jn denote the sizes of sets А, B, n steps 
(when the words are of length n). Then 


аһы 7 An T d, Ja 1 €n 

b, с Gp Sn«1 = 2 

Ch+1 T b, h, +1 δν + h, 

d, +1 Ср T: d, і, +1 «-- b, tf. + і, +j, 


o 
буг! 


C, 42 En + En T h, Jn+1 ^ ἴῃ t Jn 


The initial conditions are ag = 1 and all other variables 
are 0. 


The quantity we are interested in, the number of 
words whose shortest contained string of 1’s is of length 
2, is given by сп + eg + gg + hy. For n = 32, the 
difference equations give for this the value 44,410,452, 
which is about 1.034 percent of the number of 32-bit 
words. As an additional result, the number of words 
whose shortest contained string of 1’s is of length 1 is 
given by bn + fn + in + jn, which for n = 32 evaluates to 
4,247,705,401, confirming the result of the preceding 
exercise. 


This is as far as we are going with this problem. 


Chapter 7: Rearranging Bits and Bytes 


1. 


An ordinary integer сап be  incremented by 
complementing a certain number of consecutive low- 
order bits.1 For example, to add 1 to 0x321F, it suffices 
to apply the exclusive or operation to it with the mask 
ΟΧΟΟΘΕ. Similarly, to increment a reversed integer, it 
suffices to complement some high-order bits with a mask 
that consists of an initial string of 1’s followed by 0’s. 
Mobius's formula computes this mask and applies it to 
the reversed integer. (The method in the text that uses 
the nlz operation also does this.) 


For an ordinary integer, the mask consists of O's 
followed by 1’s from the rightmost 0-Ы to the low-order 
bit. The integer that consists of a 1-bit at the position of 
the rightmost O-bit in i is given by the expression —& (i+ 
1) (see Section 2-1). To increment an ordinary integer x, 
we would compute a mask by right-propagating the 1-bit 
in this integer, and then exclusive or the result to x. To 
increment a reversed integer, we need to compute the 
reflection, or bit reversal, of that mask. The one-bit 
(power of 2) quantity —1& (i+ 1) can be reflected by 
dividing it into m/ 2. (This step is the key to this 
algorithm.) For example, in the case of 4-bit integers, m/ 
2—8.8/1—8,8/2—4,8/4-—2,and8/8 = 1. 
То compute the mask, it is necessary only to left- 
propagate the 1-bit of the quotient, which is done by 


subtracting the quotient from m. Finally, the mask is 
exclusive-or'ed to the reversed integer, which produces the 
next reversed integer. 


As an example, suppose the integers are eight bits in 
length, so that m — 256. Let i — 19 (binary 00010011), 
so that revi = binary 11001000. Then —& (1+ 1) = 
binary 00000100 (decimal 4). Dividing this into m/ 2 
gives a quotient of 32 (binary 00100000). Subtracting 
this from m gives binary 1110 0000. Finally, exclusive 
oring this mask to revi gives binary 00101000, which is 
the reversed integer for decimal 20. 


2. Notice that 


ту = 2: 0х11111111 
m, = 0xC-0x01010101 
m, = 0хЕО-.0х00010001, and 


m, = OxFFOO-0x00000001. 
Also, notice that 


0x11111111 = [222/15], 
0x01010101 = | 232/255 |. 
0х00010001 = | 232/(216 - 1) |, and 
0х00000001 = | 232/ 


Thus, we have the formulas 


= ? 7 12 / 8 
m, = (22- 1)22| 222 (25 - 1) |, 
m, = (2*- 1)24| 232/(216 -- 1) |, and 


тз = (25- 1)2*[ 22/(2? - 1) |. 


m, = (2% – 1)22'| 2*7 (22? — 1) |, 


where W is the length of the word being shuffled, which 
must be a power of 2. 


3. It is necessary only to change the two lines 


Click here to view code image 


S S + b; 


x >> 1; 


to 
Click here to view code image 


s S + 1; 


x >> b; 


4. Any true LRU algorithm must record the complete order 
of references to the n cache lines in a set. Since there are 
n! orderings of n things, any implementation of LRU must 
use at least [logo п!1 memory bits. The table below 
compares this to the number of bits required by the 
reference matrix method. 


Degree of | Theoretical | Reference Matrix 
Associativity | Minimum Method 


Chapter 8: Multiplication 


1. As shown in Section 8-3, if x and y are the multiplication 
operands interpreted as signed integers, then their 
product interpreted as unsigned integers is 


(х + 232x31)(y + 232y31) = xy + 232(хз1 у + yai) + 
26^x31ya1, 


where хз1 and уз1 are the sign bits of x and y, 
respectively, as integers O or 1. Because the product 
differs from xy by a multiple of 232, the low-order 32 bits 
of the product are the same. 


2. Method 1: Chances are the machine has a multiplication 
instruction that gives the low-order 32 bits of the product 
of two 32-bit integers. That is, 


Click here to view code image 
low = u*v; 
Method 2: Just before the return statement, insert 
Click here to view code image 
low = (wl << 16) + (w0 & OXFFFF); 


Method 3: Save the products ui*vo and uo*vi in 
temporaries +1 and t2. Then 


Click here to view code image 
low = ((tl + t2) << 16) + w0; 


Methods 2 and 3 are three basic RISC instructions 
each, and they work for both muins and its unsigned 
counterpart (and may be faster than method 1). 


3. Partition the 32-bit operands u and v into 16-bit unsigned 
components а, b, c, and d, so that 


и = 255445 and 
у = 2!6c 4 d, 


where 0 < a, b, c, d x 216 - 1. Let 


A 


р = ac, 

4 = bd, and 

(— a + b)(c— d). 

232 p + 216(г + p + а + q, which is easily 


| 


Then uv 
verified. 


Now 0 < p,q x 232 - 217 + 1, so that p and q can be 
represented by 32-bit unsigned integers. However, it is 
easily calculated that 


-232 + 2171 < r = 232.2174], 


so that r is a signed 33-bit quantity. It will be convenient 


to represent it by a signed 64-bit integer, with the high- 
order 32 bits being either all O's or all 1’s. The machine's 
multiply instruction will compute the low-order 32 bits of 
r, and the high-order 32 bits can be ascertained from the 
values of - a + b and c - d. These are 17-bit signed 
integers. If they have opposite signs and are nonzero, 
then r is negative and hence its high-order 32 bits are all 
178. If they have the same signs or either is 0, then г is 
nonnegative and hence its high-order 32 bits are all 0’s. 
The test that either - a + b orc- d is 0 can be done Бу 
testing only the low-order 32 bits of r. If they are 0, then 
one of the factors must be 0, because r < 232. 

These considerations lead to the following function for 
computing the high-order 32 bits of the product of u and 
y. 


Click here to view code image 


unsigned mulhu(unsigned u, unsigned v) { 
unsigned a, b, c, d, p, q, rlow, rhigh; 


a 
ς 


=u >> 16; b = u & OxFFFF; 
у >> 16; а = v & OxFFFF; 


р = а*с; 

а = ολα; 

rlow = (-a + b)*(c- а); 

rhigh = (int) ((-a + b)*(c - d)) >> 31; 

if (rlow == 0) rhigh = 0; // Correction. 

а = q+ (а >> 16); // Overflow cannot occur here. 


rlow = rlow + p; 
if (rlow < p) rhigh = rhigh + 1; 
rlow = rlow + q; 
if (rlow < q) rhigh = rhigh + 1; 


return p + (rlow >> 16) + (rhigh << 16); 


After computing p, q, rlow, and rhigh, the function 
does the following addition: 


Click here to view code image 


The statement “if (rlow « p) rhigh = rhigh + 1” is 
adding 1 to rhign if there is a carry from the addition of 
p to rlow in the previous statement. 


The low-order 32 bits of the product can be obtained 
from the following expression, inserted just after the 
“correction” step above: 


Click here to view code image 
а + ((p + q+ rlow) << 16) 
A branch-free version follows. 


Click here to view code image 


unsigned mulhu(unsigned u, unsigned v) { 
unsigned a, b, c, d, p, q, x, y, rlow, rhigh, t; 


а = u >> 16; р = u & OxFFFF; 

с = v >> 16; а = у & OxFFFF; 

р = a*c; 

а = b*d; 

x = а + b; 

у= с-а; 

rlow = x*y; 

rhigh = (x ^ y) & (rlow | -rlow); 
rhigh = (int)rhigh >> 31; 


а = q + (а >> 16); // Overflow cannot occur here. 

Е = (rlow & OxFFFF) + (p 4 OxFFFF) + (а 6 ΟΧΕΕΕΕ); 

р += (t >> 16) + (rlow >> 16) + (р >> 16) + (а >> 16); 
p += (rhigh << 16); 

return p; 


These functions have more overhead than the four- 
multiplication function of Figure 8-2 on page 174, and 
will be superior only if the machine’s multiply instruction 
is slower than that found on most modern computers. In 
“bignum” arithmetic (arithmetic on multiword integers), 
the time to multiply is substantially more than the time to 


add two integers of similar sizes. For that application, a 
method known as Karatsuba multiplication [Karat] 
applies the three-multiplication scheme recursively, and it 
is faster than the straightforward four-multiplication 
scheme for sufficiently large numbers. Actually, Karatsuba 
multiplication, as usually described, uses 


р = ac, 

q = bd, 

r = (a*t b)(c* d), and 

uv = 2?p t 21l6(r-p—q) +9. 
For our application, that method does not work out very 
well because r can be nearly as large as 234, and there 
does not seem to be any easy way to calculate the high- 
order two bits of the 34-bit quantity r. 
A signed version of the functions above has problems 

with overflow. It is just as well to use the unsigned 


function and correct it as described in Section 8-3 on 
page 174. 


Chapter 9: Integer Division 


1. Let x = хо + 9, where хо is an integer and 0 < 6 < 1. 


Thea [ασ] = by the definition of the 
ceiling function as the next integer greater than or equal 


to its argument. Hence "5.5 Хо, which is Lx] 


2. Let n / d denote the quotient of signed, truncating, integer 


division. Then we must compute 


n/ d, if n20,4d»0, 
п/а- l. ifn<0,d>0, 
n/ d, if n20,d «0, 
n/d l,ifn«O0,d«0 


(If d = 0 the result is immaterial.) This can be computed 
as п / d + c, where 


= ((n> 31) @ (d> 31)) — (d > 31). 


which is four instructions to compute c (the term d > 31 


commons). Another way to compute c in four 
instructions, but with the shifts unsigned, is 


с = (d 31)-((п © d) > 31) 


If your machine has mod-32 shifts, c can be computed in 
three instructions: 


с = (n 31) > (d > 31). 
For the remainder, let rem(n, d) denote the remainder 


upon dividing the signed integer n by the signed integer d, 
using truncating division. Then we must compute 


rem(n, d), ifnz0,d»0, 
rem(n, d) +d, 1#п< 0, а> 0, 
rem(n, d), їїп20,4-0, 


rem(n, d) - d, i£ n «0,d «0. 


The amount to add to rem(n, d) is 0 or the absolute value 
of d. This can be computed from 


ld = (q @ (d> 31)) — (45 31). 


C 


|| 


la] & (n > 31 3; 
which is five instructions to compute c. It can be 
computed in four instructions if your machine has 
mod-32 shifts and you use the multiply instruction 
(details omitted). 

3. To get the quotient of floor division, it is necessary only to 
subtract 1 from the quotient of truncating division if the 
dividend and divisor have opposite signs: 


n/d—((n © а) 431). 


For the remainder, it is necessary only to add the 
divisor to the remainder of truncating division if the 
dividend and divisor have opposite signs: 


rem(n, d) + (((n © d) > 31) ἅς а). 


4. The usual method, most likely, is to compute Те +d- 


v/a), The problem is that n + d- 1 can overflow. 
(Consider computing [12/51 on a 4-bit machine.) 


_ Another standard method is to compute q = л /d J 
using the machine’s divide instruction, then compute the 


remainder as r = n- qd, and if r is nonzero, add 1 to q. 
(Alternatively, add 1 if n = qd.) This gives the correct 
result for all n and d = 0, but it is somewhat expensive 
because of the multiply, subtract, and conditional add of 
1. On the other hand, if your machine's divide instruction 
gives the remainder as a by-product, and especially if it 
has an efficient way to do the computation q = q + (r £ 
0), then this is a good way to do it. 


Still another way is to compute q = Lin -1)/ 4! 45:41. 
Unfortunately, this fails for n = 0. It can be fixed if the 


machine has a simple way to compute the x * 0 predicate, 
such as by means of a compare instruction that sets the 
target GPR to the integer 1 or 0 (see also Section 2-12 on 
page 23). Then one can compute: 


c «< (x= 0) 
q« L(n-c)/d|*c 


Lastly, one can compute q — Lin -1)/ а] + 1 апа 
then change the result їо 0 if n = 0, by means of a 


conditional move or select instruction, for example. 


. Let f (Lx) = a and f (x) = b, as illustrated below. 


Lx] x 
If b is an integer, then by property (c), x is also, so that 


Lx] = x, and there is nothing to prove. Therefore, assume 
in what follows that b is not an integer, but a may or may 


not be. 


There cannot be an integer k such that a < k < b, 


because if there were, there would be an integer between 
LX | and x (by properties (a), (b), and (c)) which is 
impossible. Therefore Lid = Lo]. that is, Ly dxb] = F 
wd 

As examples of the utility of this, we have, for a and b 
integers, 


b b 
[Их] = Lx] 
| logxLx D | = | log.) | 


It can similarly be shown that if f (x) has properties (a), 
(b), and (c), then 


-.ψ:ν 


Kix DI = [πω]. 
Chapter 10: Integer Division by Constants 


1. (a) If the divisor is even, then the low-order bit of the 
dividend does not affect the quotient (of floor division); if 
it is 1 it makes the remainder odd. After turning this bit 
off, the remainder of the division will be an even number. 
Hence for an even divisor d, the remainder is at most d - 
2. This slight change in the maximum possible remainder 
results in the maximum multiplier m being a W-bit 
number rather than a (W + 1)-bit number (and hence the 
shrxi instruction is not needed), as we will now see. In 
fact, we will investigate what simplifications occur if the 
divisor ends in z O-bits, that is, if it is a multiple of 2z, for 
z = 0. In this case, the z low-order bits of the dividend 
can be cleared without affecting the quotient, and after 
clearing those bits, the maximum remainder is d - 25. 


Following the derivation of Section 10-9 on page 230, 
but changed so that the maximum remainder is d – 25, we 
have пс = 2W - rem(2W, d) - 2z, and inequality (24a) 
becomes 


2Ww-d x n, x 2W-27. 


Inequality (25) becomes 


Equation (26) is unchanged, and inequality (27) becomes 


P> Ξε(ά-- | — rem(2? — 1, Ф). Q7) 


Inequality (28) becomes 


In the case that p is not forced to equal W, combining 
these inequalities gives 


2n.(d — 1) + 2*n, + 2? 


l- 
-<т< 


а 2:4 п, 
Р 24-2 + 22/п, Е 
1 < т < (n, + 27) 
224 : 


1< т < 2 +2 < 29" 
> т ыг 2 Ыы А 
Thus if z 2 1, т < 2W, so that m fits іп a W-bit word. 
The same result follows in the case that p is forced to 
equal W. 

To calculate the multiplier for a given divisor, 
calculate пс as shown above, then find the smallest p 2 W 
that satisfies (27), and calculate m from (26). As an 
example, for d = 14 and W = 32, we have n, = 232 – 
rem(232, 14) - 2 = OxFFFFFFFA. Repeated use of (27^) 
gives p = 35, from which (26) gives m = (235 + 14-1- 
3) / 14 — 0x92492493. Thus, the code to divide by 14 is 


Click here to view code image 


ins n,R0,0,1 Clear low-order bit of n. 
li M,0x92492493 Load magic number. 

mulhu q,M,n q = floor(M*n/2**32). 
shri q,q,3 а = 4/8. 


(b) Again, if the divisor is a multiple of 2z, then the low- 
order z bits of the dividend do not affect the quotient. 
Therefore, we can clear the low-order z bits of the 
dividend, and divide the divisor by 2z, without changing 
the quotient. (The division of the divisor would be done 
at compile time.) 


Using the revised n and d, both less than 2w-z, (24a) 
becomes 


2w-z-d < n, <2W2-] 


Equation (26) and inequality (27) are not changed, but 
they are to be used with the revised values of пс and d. 
We omit the proof that the multiplier will be less than 2W 
and give an example again for d = 14 and W = 32. In the 
equations, we use d = 7. Thus, we have пс = 23! - 
rem(231, 7) - 1 = Ox7FFFFFFF. Repeated use of (27) 
gives p = 34, from which (26) gives m = (234 + 5) / 7 
— 0x92492493, and the code to divide by 14 is 


Click here to view code image 


shri n,n,1 Halve the dividend. 
120 М,0х92492493 Load magic number. 
mulhu q,M,n q = floor(M*n/2**32). 
shri Gd а = 4/4. 


These methods should not always be used when the 
divisor is an even number. For example, to divide by 10, 
12, 18, or 22 it is better to use the method described in 
the text, because there's no need for an instruction to 
clear the low-order bits of the dividend, or to shift the 
dividend right. Instead, the algorithm of Figure 10-3 on 
page 236 should be used, and if it gives an “add” 
indicator of 1 and the divisor is even, then one of the 
above techniques can be used to get better code on most 
machines. Among the divisors less than or equal to 100, 
these techniques are useful for 14, 28, 38, 42, 54, 56, 62, 
70, 74, 76, 78, 84, and 90. 

Which is better, (a) or (b)? Experimentation indicates 
that method (b) is preferable in terms of the number of 
instructions required, because it seems to always require 
either the same number of instructions as (a), or one 


fewer. However, there are cases in which (a) and (b) 
require the same number of instructions, but (a) yields a 
smaller multiplier. Some representative cases are shown 
below. The “Book” method is the code that Figure 10-3 
gives. We assume here that the computer’s and immediate 
instruction sign-propagates the high-order bit of the 
immediate field (our basic RISC would use the insert 
instruction). 


Book 8 (b) 


li М,Охааааааар andi n,n,-2 shri n,n,1 
mulhu q,M,n li M,0x2aaaaaab li HM,0x55555556 
shri q,q,2 mulhu q,M,n mulhu q,M,n 


li M,0x24924925 andi n,n,-4 shri n,n,2 
mulhu q,M,n li M,0x24924925 11 М,0х24924925 
add q,q,n mulhu q,M,n mulhu q,M,n 
shrxi q,q,5 shri q,q,2 


These techniques are not useful for signed division. In 
that case, the difference between the best and worst code 
is only two instructions (as illustrated by the code for 
dividing by 3 and by 7, shown in Section 10-3 on page 
207). The fix-up code for method (a) would require 
adding 1 to the dividend if it is negative and odd, and 
subtracting 1 if the dividend is nonnegative and odd, 
which would require more than two instructions. For 
method (b), the fix-up code is to divide the dividend by 2, 
which requires three basic RISC instructions (see Section 
10-1 on page 205), so this method is also not a winner. 


2. Python code is shown below. 


Click here to view code image 


def magicg(nmax, d): 


nc = (nmax//d)*d - 1 

nbits = int(log(nmax, 2)) + 1 

for p in range(0, 2*nbits - 1): 
if 2**p > nc*(d - (2**p)%d): 


m = (2**p + d - (2**p)$d) //4 
return (m, p) 


print "Can't find p, something is wrong." 
sys.exit(1) 


3. Because 81 — 34, we need for the starting value, the 
multiplicative inverse of d modulo 3. This is simply the 
remainder of dividing d by 3, because 1: 1 = 1 (mod 3) 
and 2-2 = 1 (mod 3) (and if the remainder is 0, there is 
no multiplicative inverse). For d = 146, the calculation 
proceeds as follows. 


Xy = 146 mod 3 = 2, 

x, = 2(2- 146.2) = -580 = 68 (mod 81), 

x; = 68(2- 146.68) = 674,968 = 5 (mod 81), 
x; = 5(2-- 146. 5) = —3640 = 5 (mod 81) 


A fixed point was reached, so the multiplicative inverse of 
146 modulo 81 is 5. Check: 146-5 = 730 = 1 (mod 81). 
Actually, it is known a priori that two iterations suffice. 


Chapter 11: Some Elementary Functions 


1. Yes. The result is correct in spite of the double truncation. 


Suppose | & ] = а. Then by the definition of this 
operation, a is an integer such that a? « x and (a + 1)? 


< х. 
Let [Ма] = b Then b2 < a and (b +1)2 < a. Thus, 
b^ < a2 and, because a? < x, b^ < x. 


Because (b + 1)2 a, (b + 1)2 > а + 1, so that (b + 
1)4> (a + 1)? Because (a+ 1)2x, (b + 1)4x. Hence b is 
the integer fourth root of x. 


This follows more easily from exercise 5 of Chapter 9. 
2. Straightforward code is shown below. 
Click here to view code image 
int icbrt64 (unsigned long long х) { 


int s; 
unsigned long long y, b, bs; 


(bs >> s)) 


return y; 


{ 


Overflow of » (bs in the above code) can occur only 
on the second loop iteration. Therefore, another way to 
deal with the overflow is to expand the first two iterations 


of the loop, and then execute the loop only from s — 


57 


on down, with the phrase “ss b == (bs >> s)" deleted. 


By inspection, the effect of the first two loop iterations 


is: 


If x > 263, set x = х- 263 and set y = 2. 
If 260 < x < 263, set x = х- 260 and set y = 1. 
If x < 290, set y = 0 (and don't change x). 


Therefore, the beginning of the routine can be coded as 


shown below. 


Click here to view code image 


y = 0; 
if (х >= 0x1000000000000000LL) 4 
if (x >= 0x8000000000000000LL) { 
x = x - 0x8000000000000000LL; 
у = 2; 
) else { 
x = x - 0x1000000000000000LL; 
Y 1; 
} 
} 
for (s = 57; s >= O; s = s - 3) Í 


And, as mentioned, the phrase “єє b 
be deleted. 


(bs >> s)" Can 


3. Six [Knu2]. The binary decomposition method, based on 
x23 = xl6. x^. 52. x, takes seven. Factoring x23 as (х11)2 


: x or as ((x9)2 - x)? - x also takes seven. But computing 
powers of x in the order x2, x3, x9, x10, x13, x23, in which 
each term is a product of two previous terms or of x, does 
it in six multiplications. 


4. (a) x rounded down to an integral power of 2. (b) x 
rounded up to an integral power of 2 (in both cases, x 
itself if x is an integral power of 2). 


Chapter 12: Unusual Bases for Number Systems 


1. If B is a binary number апа М is its base -2 equivalent, 
then 


В «— 0x 588855555 — (N © 0х5555 5555), and 


2. An easy way to do this is to convert the base -2 number x 
to binary, add 1, and convert back to base -2. Using 
Schroeppel's formula and simplifying, the result is 


((x © OxAAAAAAAA) + 1) Ф OXAAAAAAAA, ог 
((x © 0x55555555) — 1) Ф 0х55555555. 


3. As in exercise 1, one could convert the base -2 number х 
to binary, and with OxFFFFFFFO, and convert back to base 
-2. This would be five operations. However, it can be 
done in four operations with either of the formulas 
below.2 


(((x © OXAAAA AAAA) - 10) © ОХАААААААА) & -16 
(((x © 0х55555555) + 10) Ф 0х55555555) & —16 


The formulas below round a number up to the next 
greater power of 16. 


(((х Ө OxAAAA AAAA) + 5) © OXAAAAAAAA) & —16 
(CCx Ө 0х55555555) — 5) © 0х55555555) & -16 


There are similar formulas for rounding up or down to 
other powers of 2. 


4. This is very easy to program in Python, because that 
language supports complex numbers. 


Click here to view code image 


import sys 
import cmath 


num = sys.argv[1:] 
if len(num) -- 
print "Converts a base -1 + 1j number, given 
decimal" 
print "or hex, to the form а + bj, with a, b real." 
sys.exit() 
num = eval (num[0]) 
r= 0 
weight = 1 
while num > 0: 
if num & 1: 
r = r + weight; 
weight (-1 + 1j)*weight 
num = num >> 1; 


print `r -', r 


in 


5. To convert a base — 1 + i number to its negative, either 
subtract it from 0 or multiply it by -1 (11101), using the 
rules for base - 1 + i arithmetic. 


To extract the real part of a number x, add in the 
negative of its imaginary part. Process the bits of x in 
groups of four, starting at the right (low-order) end. 
Number the bits in each group 0, 1, 2, and 3, from the 
right. Then: 


If bit 1 is on, add - i (0111) at the current group's 
position. 

If bit 2 is on, add 2 i (1110100) at the current group's 
position. 

If bit 3 is on, add -2 i (0100) at the current group's 
position. 

Bit 1 has a weight of - 1 + i, so adding in - i cancels 
its imaginary component. A similar remark applies to bits 
2 and 3. There is no need to do anything for bit O, 
because that has no imaginary component. Each group of 
four bits has a weight of — 4 times the weight of the group 
immediately to its right, because 10000 in base - 1 + i is 
— 4 decimal. Thus, the weight of bit n of x is a real 


number (- 4) times the weight of bit n — 4. 


The example below illustrates extracting the real part 
of the base -1 + inumber 101101101. 


1 0110 
111 


0111 
111 0100 


1100 1101 


1101 x 


0100 
0100 


1101 


2i added in for bit 2 

—2i added in for bit 3 
i(—4) added in for bit 5 

2i(— 4) added in for bit 6 


sum 


The reader may verify that x is 23 + 4i, and the sum 
is 23. In working out this addition, many carries are 
generated, which are not shown above. Several shortcuts 
are possible: If bits 2 and 3 are both on, there is no need 
to add anything in for these bits, because we would be 
adding in 2i and -2i. If a group ends in 11, these bits сап 
be simply dropped, because they constitute a pure 
imaginary (i). Similarly, bit 2 can be simply dropped, as 
its weight is a pure imaginary (-20. 

Carried to its extreme, a method employing these 
kinds of shortcuts would translate each group of four bits 
independently to its real part. In some cases a carry is 
generated, and these carries would be added to the 
translated number. To illustrate, let us represent each 
group of four bits in hexadecimal. The translation is 
shown below. 


00 4-»0 ЕС С=С 
121 5-1 9-0 D=>6 
2 — ID 6 — ID А > | EI 
3-0 70 BC F>C 


The digits 2 and 6 have real part -1, which is written 
1D in base - 1 + i. For these digits, replace the source 
digit with D and carry a 1. The carries can be added in 
using the basic rules of addition in base — 1 + i, but for 
hand work there is a more expedient way. After 
translation, there are only four possible digits: O, 1, C, and 
D, as the translation table shows. Rules for adding 1 to 


these digits are shown in the left-hand column below. 


0-1-1 0-10-10 
1+1=С 1+1 0 
С-1-0 С+1р=1 


0+ 1= 1р0 0-10-0 

Adding 1 to D generates a carry of 1D (because 3 + 1 = 
4). We will carry both digits to the same column. The 
right-hand column above shows how to handle the carry 
of 1D. In doing the addition, it is possible to get a carry of 
both 1 and 1D in the same column (the first carry from 
the translation and the second from the addition). In this 
case, the carries cancel each other, because 1D is -1 in 
base — 1 + i. It is not possible to get two carries of 1, or 
two of 1D, in the same column. 


The example below illustrates the use of this method 
to extract the real part of the base – 1 + i number EA26 
(written in hexadecimal). 

EA26 x 

11 carries from the translation 

11DD x with its hex digits translated 

110D sum 
The reader may verify that x is - 45 + 21 i and the sum 
is — 45. 

Incidentally, a base - 1 + i number is real iff all of its 
digits, expressed in hexadecimal, are 0, 1, C, or D. 


To extract the imaginary part from x, one can, of 
course, extract the real part and subtract that from x. To 
do it directly by the "shortcut" method, the table below 
shows the translation of each hexadecimal digit to its pure 
imaginary part. 


0—0 424 8— 74 С => 
120 5254 9 — 74 О= 0 
2 3 67 A = 77 Ε-»3 


323 727] B = 77 Е > 3 


Thus, a carry of 7 can occur, so we need addition rules 
to add 7 to the four possible translated digits of 0, 3, 4, 
and 7. These are shown in the left-hand column below. 


0+7=7 073-23 
3+7=0 3+3 = 74 
4+7 = 33 4+3=7 
7+7=4 7+3=0 


Now а carry of З can occur, and the right-hand column 
above shows how to deal with that. 


The example below illustrates the use of this method 
to extract the imaginary part of the base – 1 + i number 
568A (written in hexadecimal). 

568A x 

77 carries from the translation 

4747 x with its hex digits translated 

4737 sum 
The reader may verify that x is - 87 + 107 i and the sum 
is 107i. 

A base - 1 + i number is imaginary iff all of its digits, 
expressed in hexadecimal, are 0, 3, 4, or 7. 


To convert a number to its complex conjugate, 
subtract twice a number's imaginary part. A table can be 
used, as above, but the conversion is more complicated 
because more carries can be generated, and the translated 
number can contain any of the 16 hexadecimal digits. The 
translation table is shown below. 


00 4 > 74 5 > 38 с=с 
| | 5 = 75 9 — 39 DD 
2-6 62 A — 3E Е > ЗА 
327 723 В = ЗЕ Е => ЗВ 


The carries can be added in using base - 1 + i 
arithmetic or by devising a table that does the addition a 
hexadecimal digit at a time. The table is larger than those 
above, because the carries can be added to any of the 16 


possible hexadecimal digits. 
Chapter 13: Gray Code 


1. Proof sketch 1: It is apparent from the construction of the 

reflected binary Gray code. 
Proof sketch 2: From the formula 
m u 

G(x) = x € (x > 1), it can be seen that G(x) is 1 at 
position i wherever there is a transition from 0 to 1 or 
from 1 to 0 from position i to the bit to the left of i, and is 
O otherwise. If x is even, there are an even number of 
transitions, and if x is odd, there are an odd number of 
transitions. 


Proof sketch 3: By induction on the length of x, using 
the formula given above: The statement is true for the 
one-bit words 0 and 1. Let x be a binary word of length n, 
and assume inductively that the statement is true for x. If 
x is prepended with a 0-bit, G(x) is also prepended with a 
O-bit, and the remaining bits are G(x). If x is prepended 
with a 1-bit, then G(x) is also prepended with a 1-bit, and 
its next most significant bit is complemented. The 
remaining bits are unchanged. Therefore, the number of 
1-bits in G(x) is either increased by 2 or is unchanged. 


Thus, one can construct a random number generator 
that generates integers with an even (or odd) number of 
l-bits by using a generator of uniformly distributed 
integers, setting the least significant bit to O (or to 1), and 
converting the result to Gray code [Arndt]. 

2. (a) Because each column is a cyclic shift of column 1, the 
result follows immediately. 
(b) No such code exists. This is not difficult to verify by 
enumerating all possible Gray codes for n — 3. Without 
loss of generality, one can start with 


Click here to view code image 
000 


001 
011 


because any Gray code can be made to start that way by 
complementing columns ап rearranging columns. 


Corollary: There is no STGC for n — 3 that has eight code 
words. 


3. The code below was devised by reflecting the first five 
code words of the reflected binary Gray code. 


Click here to view code image 


0000 
0001 
0011 
0010 
0110 
1110 
1010 
1011 
1001 
1000 


Another code can be derived by taking the “excess 3" 
binary coded decimal (BCD) code and converting it to 
Gray. The result turns out to be cyclic. The excess 3 code 
for encoding decimal digits has the property that addition 
of coded words generates a carry precisely when addition 
of the decimal digits would. 


EXCESS THREE GRAY CODE 


Decimal Excess3 Gray Code 
Digit Code Equivalent 


© 


1 
2 
3 
4 
5 
6 
7 
8 
9 


4. It is a simple matter to derive a *mixed base" Gray code, 
using the principle of reflection. For a number with prime 
decomposition 2e 13e 25e 3, the columns of the Gray code 
should be in base e1 + 1, e2 + 1, e3 + 1,.... For example, 
for the number 72 = 2? - 32, the list below shows a “base 
4 - base 3” Gray code and the divisor of 72 that each code 
word represents. 


Click here to view code image 


00 1 
01 3 
02 9 
12 -18 
T: 6 
10 2 
20 4 
21. 12 
22. 36 
32 72 
31 24 
30 8 


Clearly each divisor follows from the previous one by one 
multiplication or division by a prime number. 


Even simpler: A binary Gray code can be used to 
iterate over the subsets of a set in such a way that in each 
step only one member is added or removed. 


Chapter 14: Cyclic Redundancy Check 


1. From the text, a message polynomial M and generator 
polynomial G satisfy Mx’ = QG + R, where R is the 
checksum polynomial. Let M ’ be a message polynomial 
that differs from M at term хе. (That is, the binary 
message differs at bit position e.) Then Μ΄ = M + хе, 
and 


Mx" = (М+хехт = Mx" + xetr = QG+R+xe+tr 


The term x* + ris not divisible by G, because G has two 
or more terms. (The only divisors of x€ + r are of the 
form x> >.) Therefore, the remainder upon dividing M'x* 
by G is distinct from R, so the error is detected. 


2. The main loop might be coded as shown below, where 


word is an unsigned int [Danne]. 


Click here to view code image 


СО 
whi 


{ 


= OxFFFFFFFF; 

le (((word = *(unsigned int *)message) & OxFF) != 0) 
crc = crc ^ word; 

crc = (cro >> 8) ^ table[crc & OxFF]; 

crc = (cro >> 8) ^ table[crc & OxFF]; 

crc = (cro >> 8) ^ table[crc & OxFF]; 

crc = (cro >> 8) ^ table[crc & OxFF]; 

message = message + 4; 


Compared to the code of Figure 14-7 on page 329, 
this saves three load byte and three exclusive or 
instructions for each word of message. And, there are 
fewer loop control instructions executed. 


Chapter 15: Error-Correcting Codes 


1. Your table should look like Table 15-1 on page 333, with 


the rightmost column and the odd numbered rows 
deleted. 


. In the first case, if an error occurs in a check bit, the 


receiver cannot know that, and it will make an erroneous 
“correction” to the information bits. 


In the second case, if an error occurs in a check bit, 
the syndrome will be one of 100...0, 010...0, 001...0, ..., 
000...1 (k distinct values). Therefore k must be large 
enough to encode these k values, as well as the m values 
to encode a single error in one of the m information bits, 
and a value for “no errors.” So the Hamming rule stands. 


One thing along these lines that could be done is to 
have a single parity bit for the k check bits, and have the 
k check bits encode values that designate one error in an 
information bit (and where it is), or no errors occurred. 
For this code, k could be chosen as the smallest value for 
which 2k 2 m + 1. The code length would be m + k + 
1, where the “+1” is for the parity bit on the check bits. 
But this code length is nowhere better than that given by 
the Hamming rule, and is sometimes worse. 


3. Treating k and m as real numbers, the following iteration 
converges from below quite rapidly: 


ky = 0, 
ИА), í = 0,1,..., 


where lg(x) is the log base 2 of x. The correct result is 
given by ceil(k2) is, only two iterations are required for 
all m = 0. 


Taking another tack, it is not difficult to prove that for 
m = 0, 


bitsize(m) x k x bitsize(m) + 1. 


Here bitsize(m) is the size of m in bits, for example, 
bitsize(3) = 2, bitsize(4) = 3, and so forth. (This is 
different from the function of the same name described in 
Section 5-3 on page 99, which is for signed integers.) 
Hint: bitsize(m) = Пе(т + 1)1 = ест) + id where we 
take 18(0) to be -1. Thus, one can try К = bitsize(m), test 
it, and if it proves to be too small then simply add 1 to 
the trial value. Using the number of leading zeros function 
to compute bitsize(m), one way to commit this to code is: 


k < W — nlz(m), 
К k+(((1=< k)-1- k) * m), 
where W is the machine's word size and 0 x m x 2w - 1. 


4. Answer: If d(x,z)>d(x,y) + d(y,z), it must be that for at 
least one bit position i, that bit position contributes 1 to 
d(x,z) and 0 to 4(х,у) + d(y,z). This implies that xj = zi, 
but xj = yj and у; = zi, clearly a contradiction. 


5. Given a code of length n and minimum distance d, simply 
double-up each 1 and each 0 in each code word. The 
resulting code is of length 2n, minimum distance 2d, and 
is the same size. 


6. Given a code of length n, minimum distance d, and size 
A(n, d), think of it as being displayed as in Table 15-1 on 
page 333. Remove an arbitrary d- 1 columns. The 
resulting code words, of length n-(d-1), have a minimum 


distance of at least 1. That is, they are all distinct. Hence 
their number cannot be more than 2”-4 - 1). Since 
deleting columns did not change the code size, the 
original code's size is at most 2n(d-1), so that A(n,d) = 2n 
-а+1, 

7. The Hamming rule applies to the case that d = 3 and the 
code has 2m code words, where m is the number of 
information bits. The right-hand part of inequality (6), 
with A (n, d) = 2m and d = 3, is 


эп эп 
т < = = 


Е (п (μὴ 1+п 


t| 
κα XU 


Replacing n with m + k gives 


Эт + & 
Эт < - 
2 


ШЕТ ҮЭ 
which on cancelling 2™ on each side becomes inequality 


(1). 


8. The code must consist of an arbitrary bit string and its 
one's-complement, so its size is 2. That these codes are 
perfect, for odd n, can be seen by showing that they 
achieve the upper bound in inequality (6). Proof sketch: 
An n -bit binary integer may be thought of as 
representing uniquely a choice from n objects, with a 1- 
bit meaning to choose and a 0-і meaning not to choose 
the corresponding object. Therefore, there are 2n ways to 
choose from 0 to n objects from n objects—that is, 
$ (7) = 2» 

и . If n is odd, i ranging from 0 to (n - 1)/2 
covers half the terms of this sum, and because of the 

(n) = ( n ) 


symmetry ΑΙ An— U. it accounts for half the sum. 
(n `x "(m = ^n-l 
— (2! ын 
Therefore i=0 ^" so that the upper bound 


in (6) is 2. Thus, the code achieves the upper bound of 
(6). 


9. For ease of exposition, this proof will make use of the 
notion of equivalence of codes. Clearly a code is not 
changed in any substantial way by rearranging its 


columns (as depicted in Table 15-1 on page 333) or by 
complementing any column. If one code can be derived 
from another by such transformations, they are said to be 
equivalent. Because a code is an unordered set of code 
words, the order of a display of its code words is 
immaterial. By complementing columns, any code can be 
transformed into an equivalent code that has a code word 
that is all 0’s. 


Also for ease of exposition, we illustrate this proof by 
using the case n = 9 and d = 6. 


Wlog (without loss of generality), let code word 0 (the 
first, which we will call cwo) be 000 000 000. Then all 
other code words must have at least six 1’s, to differ from 
cwo in at least six places. 


Assume (which will be shown) that the code has at 
least three code words. Then no code word can have 
seven or more 15. For if one did, then another code word 
(which necessarily has six or more 1’s) would have at 
least four of its 1’s in the same columns as the word with 
seven or more 1’s. This means the code words would be 
equal in four or more positions, so they could differ in 
five or fewer positions (9 — 4), violating the requirement 
that d — 6. Therefore, all code words other than the first 
must have exactly six 1’s. 


Wlog, rearrange the columns so that the first two code 
words are 


cwo: 000 000 000 
cw1: 111 111 000 


The next code word, cw», cannot have four or more of its 
l's in the left six columns, because then it would be the 
same as cw in four or more positions, so it would differ 
from cw; in five or fewer positions. Therefore it has three 
or fewer of its 1’s in the left six columns, so that three of 
its 1’s must be in the right three positions. Therefore 
exactly three of its 1’s are in the left six columns. 
Rearrange the left six columns (of all three code words) 
so that cw» looks like this: 


cw»: 111 000 111 


By similar reasoning, the next code word (cw3) cannot 
have four of its 1’s in the left three and right three 
positions together, because it would then equal cw» in 
four positions. Therefore it has three fewer 1’s in the left 
three and right three positions, so that three of its 1’s 
must be in the middle three positions. By similarly 
comparing it to cw1, we conclude that three of its 1’s 
must be in the right three positions. Therefore cw3 is: 


cw3: 000 111 111 


By comparing the next code word, if one is possible, 
with cw , we conclude that it must have three 1’s in the 
right three positions. By comparing it with cw2, we 
conclude it must have three l's in the middle three 
positions. 


Thus, the code word is 000 111 111, which is the same as 
cw3. Therefore a fifth code word is impossible. By 
inspection, the above four code words satisfy d — 6, so 
А(9, 6) = 4. 


10. Obviously A(n, d) is at least 2, because the two code words 


11. 


can be all O's and all 1’s. Reasoning as in the previous 
exercise, let one code word, cwo, be all 0’s. Then all other 
code words must have more than 2n/3 T's. If the code has 
three or more code words, then any two code words other 
than cwo must have 1’s in the same positions for more 
than 2n/3 - n/3 = n/3 positions, as suggested by the 
figure below. 


1111...11110...0 
2113 <т/3 
(The figure represents cw; with its 1’s pushed to the left. 
Imagine placing the more than 2n/3 15 of cw2 to 
minimize the overlap of the 1’s.) Since cw, and суо 
overlap in more than n/3 positions, they can differ in less 
than n - n/3 = 2n/3 positions, resulting in a minimum 
distance less than 2n/3. 


It is SEC-DED, because the minimum distance between 
code words is 4. To see this, assume first that two code 
words differ in a single information bit. Then in addition 
to the information bit, the row parity, column parity, and 


corner check bits will be different in the two code words, 
making their distance equal to 4. If the information words 
differ in two bits, and they are in the same row, then the 
row parity bit will be the same in the two code words, but 
the column parity bit will differ in two columns. Hence 
their distance is 4. The same result follows if they are in 
the same column. If the two differing information bits are 
in different rows and columns, then the distance between 
the code words is 6. Lastly, if the information words differ 
in three bits, it is easy to verify that no matter what their 
distribution among the rows and columns, at least one 
parity bit will differ. Hence the distance is at least 4. 


If the corner bit is not used, the minimum distance is 
3. Therefore it is not SEC-DED, but it is a SEC code. 


Whether the corner check bit is a row sum or a 
column sum, it is the modulo 2 sum of all 64 information 
bits, so it has the same value in either case. 


The code requires 17 check bits, whereas the 
Hamming code requires eight (see Table 15-3 on page 
336), so it is not very efficient in that respect. 


But it is effective in detecting burst errors. Assume the 
9 X9 array is transmitted over a bit serial channel in the 
order row 0, row 1,..., row 8. Then any sequence of ten or 
fewer bits is in one or two rows with at most one bit of 
overlap. Hence if the only errors in a transmission are a 
subset of ten consecutive bits, the error will be detected 
by checking the column parities in most cases, or the row 
parity bits in the case that the first and tenth bits only are 
in error. 


An error that is not detected is four corrupted bits 
arranged in a rectangle. 


Chapter 16: Hilbert's Curve 
1. and 2. 


(x, у) = unshuf(s) (x, у) = unshuf(Gray(s)) 


The average jump distance for the traversal shown at 
the left above is approximately 1.46. That for the 
traversal shown at the right is approximately 1.33. 
Therefore, using the Gray code seems to improve locality, 
at least by this measure. (For the Hilbert curve, the jumps 
are all of distance 1.) 


At Edsger Dijkstra's suggestion, the shuffle algorithm 
was used in an early Algol compiler to map a matrix onto 
backing store. The aim was to reduce paging operations 
when inverting a matrix. He called it the "zip-fastener 
algorithm.” It seems likely that many people have 
discovered it independently. 


3. Use every third bit of s. 
Chapter 17: Floating-Point 


1. +0, +2.0, and certain NaNs. 


2. Yes! The program is easily derived by noting that if x — 
2n(1 +f), then 


ЙГ = 2n/2(] + fyli2, 
Ignoring the fraction, this shows that we must change 
the biased exponent from 127 + n to 127 + n /2. The 
latter is (127 +п)/2 + 127/2. Thus, it seems that a rough 


approximation to vx is obtained by shifting rep(x) right 
one position and adding 63 in the exponent position, 


which is 0х1Е800000. This approximation, 


гер(„/х) = А + (rep(x) > 1). 
also has the property that if we find an optimal value of k 


for values of x in the range 1.0 to 4.0, then the same 
value of k is optimal for all normal numbers. After 
refining the value of k with the aid of a program that 
finds the maximum and minimum error for a given value 
of k, we obtain the program shown below. It includes one 
step of Newton-Raphson iteration. 


Click here to view code image 


float asqrt(float x0) { 
union {int ix; float x;}; 


x — x0; // x can be viewed as 
int. 

ix = Oxlfbb67a8 + (ix >> 1); // Initial guess. 

x = 0.5f* (x + x0/x); // Newton step. 


return X; 


For normal numbers, the relative error ranges from 0 
to approximately 0.000601. It gets the correct result for x 
— inf and x — NaN (inf and NaN, respectively). For x — 
0 the result is approximately 4.0 x 10-20. For x = -0, the 
result is the rather useless -1.35 x 1019. For x a positive 
denorm, the result is either within the stated tolerance or 
is a positive number less than 10-19. 


The Newton step uses division, so on most machines 
the program is not as fast as that for the reciprocal square 
root. 


If a second Newton step is added, the relative error for 
normal numbers ranges from 0 to approximately 
0.00000023. The optimal constant is 0х1ЕВВЗЕВО. If no 
Newton step is included, the relative error is slightly less 
than +0.035, using a constant of OxIFBBAF2E. This is 
about the same as the relative error of the reciprocal 
square root routine without a Newton step, and like it, 
uses only two integer operations. 


3. Yes, one can do cube roots of positive normal numbers 
with basically the same method. The key statement is the 
first approximation: 


Click here to view code image 


1 = 0х2а51067Е + 1/3; // Initial guess. 


This computes the cube root with a relative error of 
approximately 0.0316. The division by 3 can be 
approximated with 


(where the divisions by powers of 2 are implemented as 
right shifts). This can be evaluated with seven 
instructions and slightly improved accuracy as shown in 
the program below. (This division trick is discussed in 
Section 10-18 on page 251.) 


Click here to view code image 


float acbrt(float x0) { 
union {int ix; float x;}; 


x — x0; // x can be viewed as int. 

ix = ix/4 + ix/16; // Approximate divide by 
Su 

ix = ix + ix/16; 

ix = ix + ix/256; 

ix = 0x2a5137a0 + ix; // Initial guess. 

x = 0.33333333f£*(2.0f*x + x0/(x*x)); // Newton 
step. 


return x; 


} 


Although we avoided the division by 3 (at a cost of 
seven elementary integer instructions), there is a division 
and four other instructions in the Newton step. The 
relative error ranges from 0 to approximately +0.00103. 
Thus, the method is not as successful as in the case of 
reciprocal square root and square root, but it might be 
useful in some situations. 


If the Newton step is repeated and the same constant 
is used, the relative error ranges from 0 to approximately 
+ 0.00000116. 


4. Yes. The program below computes the reciprocal square 
root of a double-precision floating-point number with an 
accuracy of about +3.5%. It is straightforward to 


improve its accuracy with one or two steps of Newton- 
Raphson iteration. Using the constant Ox5fe80...0 gives a 
relative error in the range 0 to approximately + 0.887, 
and the constant Ox5fe618fdf80...0 gives a relative error 
in the range 0 to approximately -0.0613. 


Click here to view code image 


double rsqrtd(double x0) { 
union {long long ix; double x;}; 


x — x0; 
ix = 0O0x5fe6ec85e8000000LL - (іх >> 1); 
return x; 


Chapter 18: Formulas for Primes 


1. Let f(x) = anx” + an ΙΧ] +... + а. Such a polynomial 
monotonically approaches infinity, in magnitude, as x 
approaches infinity. (For sufficiently large x, the first 
term exceeds in magnitude the sum of the others.) 


Let хо be an integer such that |/х)| = 2 for all x > 


xo. Let f (xo) — k, and let r be any positive integer. Then | 
k |= 2, and 


А , 2-1 
Axo + rk)| = la, (хо *rk) +a,_\(xot+rk) +... + a,| 
= γαι) + a multiple of τὰ 
= |k + a multiple of rf. 
Thus, as r increases, |f(xo + rk) | ranges over composites 
that increase in magnitude, and hence are distinct. 
Therefore f (x) takes on an infinite number of composite 
values. 

Another way to state the theorem is that there is no 
non-constant polynomial in one variable that takes on 
only prime numbers, even for sufficiently large values of 
its argument. 

Example: Let f (x) = x2 + x + 41. Then f (1) = 43 
and 


f + 43r) = (1+ 43r! + (1 +43r) +41 
= (1+ 86r + 43 p^) + (1 + 43r) + 41 
= 1+1+41 + 86r 4327 + 43r 
= 43+ (2 43r 1) - 43r 
which clearly produces ever-increasing multiples of 43 as 
r increases. 
2. Suppose p is composite. Write the congruence as 


(p - 1)! = pk- 1, 


for some integer k. Let a be a proper factor of p. Then a 
divides the left side, but not the right side, so equality 
cannot hold. 


The theorem is easily seen to be true for p — 1, 2, and 
3. Suppose p is a prime greater than 3. Then in the 
factorial 


(p- 1)! = (p- 1)O - 2)...03)(2), 


the first term, p - 1, is congruent to -1 modulo p. Each of 
the other terms is relatively prime to p and therefore has 
a multiplicative inverse modulo p (see Section 10-16 on 
page 240), and furthermore, the inverse is unique and not 
equal to itself. 


To see that the multiplicative inverse modulo a prime 
is not equal to itself (except for 1 and p - 1), suppose a? 
= 1 (mod p). Then a2 - 1 = 0 (mod p), so that (a — 1)(a 
+ 1) = 0 (mod p). Because p is a prime, either a - 1 ога 
+ 1 is congruent to 0 modulo p. In the former case a = 1 
(mod p) and in the latter case a = -1 = p- 1 (mod p). 


Therefore, the integers p — 2, p — 3, ..., 2 can be paired 
so that the product of each pair is congruent to 1 modulo 
p. That is, 


(p - 1)! = (p - 1Y(ab)(cd)..., 


where a and b are multiplicative inverses, as are c and d, 
and so forth. Thus 


(p-1)! = (-1)(1)(1) = -1 (mod p). 


Example, р = 11: 10! (mod 11)= 10*9*8*7-*6*5* 
4*3*2 (mod 11) = 10-(9-5)(8-7)(6-2)(4-3) (mod 11) = 
(-13(13(13(13(1) (тоа 11) = -1 (mod 11). 


The theorem is named for John Wilson, a student of 
the English mathematician Edward Waring. Waring 
announced it without proof in 1770. The first published 
proof was by Lagrange in 1773. The theorem was known 
in medieval Europe around 1000 AD. 


3. If n = ab, with a and b distinct and neither equal to 1 or 
n, then clearly a and b are less than n and hence are terms 
of (n – 1)!. Therefore n divides (n –1)!. 


If n = αὖ, then for a > 2, a2 = п> 2a, so that both a 
and 2a are terms of (n - 1)!. Therefore a? divides (n — 1)!. 


4. This is probably a case in which a calculation gives more 
insight into a mathematical truth than does a formal 
proof. 


According to Mills’s theorem, there exists a real 


number 0 such that өза] is prime for all integers п > 1. 
Let us try the possibility that for n — 1, the prime is 2. 


Then 


Lest = 2, 


so that 


2<63' <3, or (1) 
243206395 ог 
1.2599... $50 < 1.4422.... 
Cubing inequality (1) gives 


8:0? < 27. (2) 


There is a prime in this range. (From our assumption, 
there is a prime between 23 and (2 + 1)3.) Let us choose 


11 for the second prime. Then, we will have Lo32 | = 11 
if we further constrain (2) to 
1103 < 12. (3) 


Continuing, we cube (3), giving 


1331 < 0 < 1728. (4) 


We are assured that there is a prime between 1331 and 
1728. Let us choose the smallest one, 1361. Further 
constraining (4), 


1361 < 033 < 1362. 


So far, we have shown that there exists a real number 
theta such that [09 J is prime for n = 1, 2, and 3 and, by 
taking 27th roots of 1361 and 1362, that 0 is between 
1.30637 and 1.30642. 

Obviously the process can be continued. It can be 
shown that a limiting value of 0 exists, but that is not 
really necessary. If, in the limit, 0 is an arbitrary number 
in some finite range, that still verifies Mills's theorem. 


The above calculation shows that Mills's theorem is a 
little contrived. As far as its being a formula for primes, 
you have to know the primes to determine Ө. It is like the 
formula for primes involving the constant 


a — 0.203005000700011000013..., 


given on page 392. The theorem clearly has little to do 
with primes. A similar theorem holds for any increasing 
sequence provided it is sufficiently dense. 


The steps above calculate the smallest theta that 
satisfies Mills's theorem. It is sometimes called Mills? 
constant, and it has been calculated to over 6850 decimal 
places [CC]. 


5. Suppose that there exist integers a, b, c, and d such that 


(a+ b -5)(c d J 5) 2. (5) 


Equating real and imaginary parts, 
ac— 5bd = 2, and (6) 


ad ^ bc = 0. (7) 


Clearly c = 0, because if c = 0 then from (6), -5Ъа = 
2, which has no solution in integers. 


Also b = 0, because if b = 0, then from (7), either a 
or d is 0. a = O does not satisfy (5). Therefore d = 0. 
Then (5) becomes ac — 2, so one of the factors in (5) is a 
unit, which is not an acceptable decomposition. 

From (7), abd + b2c = 0. From (6), a?c - 5 abd = 2a. 
Combining, a2c + 5b2c = 2a, or 


a. + 5b! = 2a/c (8) 
(recall that c 0). The left side of (8) is at least a2 + 5, 
which exceeds 2a/c whatever the values of a and c are. 


To see that 3 is prime, the equation 
a2 + 5b2 = 3a/c 


can be similarly derived, with b = 0 and c = 0. This also 
cannot be satisfied in integers. 


The number 6 has two distinct decompositions into 
primes: 


6 = 2-3 = (1+ J-5\1—-./55) 

We have not shown that ! + -5 are primes. This can be 
shown by arguments similar to those given above 
(although somewhat longer), but it is not really necessary 
to do so to demonstrate that prime factorization is not 
unique in this ring. This is because however each of these 
numbers might factor into primes, Ше total 
decomposition will not be 2.3. 


Appendix A. Arithmetic Tables for A 4- 
Bit Machine 


In the tables in Appendix A, underlining denotes signed 
overflow. For example, in Table A-1, 7 + 1 = 8, which is not 
representable as a signed integer on a 4-bit machine, so signed 
overflow occurred. 


TABLE A-1. ADDITION 
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The table for subtraction (Table A-2) assumes that the carry 


bit for a — b is set as it would be for a+ b + 1, so that carry is 
equivalent to “not borrow." 


TABLE А-2. SUBTRACTION (ROW — COLUMN) 
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For multiplication (Tables A-3 апа А-4), overflow means that 
the result cannot be expressed as a 4-bit quantity. For signed 
multiplication (Table A—3), this is equivalent to the first five bits 
of the 8-bit result not being all 1’s or all 0’s. 


TABLE A-3. SIGNED MULTIPLICATION 


-8 -7 -6 -5 -4 -3 -2 -1 
0 1 2 3 4 5 6 7 8 9 A B C D E ҮР 
0 0 0 0 0 0 0 0 0 0 0 0 0 
1 0 1 2 3 4 5 6 7 F8 F9 FA FB 
2|0 2 4 6 8 A C E FO F2 F4 F6 
3 0 3 6 9 C E 12 15 E8 EB EE Fl 
4 0 4 8 C 10 14 18 1C EO Е4 ЕВ EC FO F4 
5 0 5 A 
6 0 6 с 
7 0 7 E 
-8 8 0 F8 FO E8 EO D8 DO C8 40 38 30 28 20 18 10 
-7 9 0 15 Е 
-6 A| 0 12 € 
-5 B 0 FB F6 Fl EC E7 E2 DD 28 23 1Е 19 14 E A 
-4 C | 0 F4 FO EC E8 E4 20 1C 18 14 10 C 8 
-3 D| 0 F7 F4 Fl EE EB 18 15 12 Е C 9 6 
-2 E | 0 FA F8 F6 FA F2 10 E C A 8 6 4 
-1 F 0 FD FC ЕВ FA F9 8 7 6 5 4 3 2 


TABLE Α-Α. UNSIGNED MULTIPLICATION 
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Tables A-5 and A-6 are for conventional truncating division. 
Table А-5 shows a result of 8 with overflow for the case of the 
maximum negative number divided by -1, but on most machines 
the result in this case is undefined, or the operation is 
suppressed. 


TABLE A—5. SIGNED SHORT DIVISION (ROW + COLUMN) 


-8 -7 -6 -5 -4 -3 -2 -1 

0 1.2 3 4 5 6 7 8 9 A B C D E F 

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 РЕ 

2 2 1 0 0 0 0 0 0 0 0 0 0 0 F E 

3 3 1 1 0 0 0 0 0 0 0 0 0 F F D 

4 4 2 1 1 0 0 0 0 0 0 0 F F E C 

5 5 2 1 1 1 9 090 0 00 Р F F E B 

6 6 3 2 1 1 1 0.0 0 F P F E D A 

7 1 3 2 1 1110 F F F Ү E D 9 

-8 8 8 C E E Р F F 1 1 1 1 2 2 4 8 
-7 9 9 D E F FP F F 0 1 1 1 1 2 3 7 
-6 A A D E F F F 0 0 0 1 10 1 2 3 6 
-5 B B E F F F 0 0 0 0 0 1 1 12 5 
-4 C C E F F 0 0 0 0 0 0 0 1 1 2 4 
-3 D D F F 0 0 0 0 0 0 0 0 0 1 1 3 
-2 E E F 0 0 0 0 0 0 0 0 0 0 0 1 2 
F F 0 0 0 0 0 0 0 0 0 0 0 0 0 1 


-1 


TABLE A-6. UNSIGNED SHORT DIVISION (ROW + COLUMN) 


Tables A-7 and A-8 give the remainder associated with 


conventional truncating division. Table А-7 shows a result of 0 


for the case of the maximum negative number divided by -1, but 
on most machines the result for this case is undefined 


operation is suppressed. 


, or the 


TABLE A-7. REMAINDER FOR SIGNED SHORT DIVISION (ROW + 


COLUMN) 


-7 -6 -5 -4 -3 -2 -1 


-8 


TABLE A-8. REMAINDER FOR UNSIGNED SHORT DIVISION (ROW + 


COLUMN) 


A B C 


Appendix B. Newton's Method 


To review Newton's method very briefly, we are given a 
differentiable function f of a real variable x and we wish to solve 
the equation f(x) = 0 for x. Given а current estimate хи of a root 
of f, Newton's method gives us a better estimate xn + 1 under 
suitable conditions, according to the formula 


Ха+1 7 Xn 


Here, /(χῃ) is the derivative of f at x = хп. The derivation of this 
formula can be read off the figure below (solve for xn 4. 1). 


d i 


The method works very well for simple, well-behaved 
functions such as polynomials, provided the first estimate is 
quite close. Once an estimate is sufficiently close, the method 
converges quadratically. That is, if r is the exact value of the 
root, and хи is a sufficiently close estimate, then 


ln + 1 — r| € On — 02. 


Thus, the number of digits of accuracy doubles with each 
iteration (e.g., if |x, — r| < 0.001 then [xn + р - τς 
0.000001). 

If the first estimate is way off, then the iterations may 


converge very slowly, may diverge to infinity, may converge to a 
root other than the one closest to the first estimate, or may loop 


among certain values indefinitely. 


This discussion has been quite vague because of phrases like 
“suitable conditions,” “well-behaved,” and “sufficiently close." 
For a more precise discussion, consult almost any first-year 
calculus textbook. 


In spite of the caveats surrounding this method, it is 
occasionally useful in the domain of integers. To see whether or 
not the method applies to a particular function, you have to 
work it out, such as is done in Section 11-1, “Integer Square 
Root,” on page 279. 


Table B-1 gives a few iterative formulas derived from 
Newton’s method, for computing certain numbers. The first 
column shows the number it is desired to compute. The second 
column shows a function that has that number as a root. The 
third column shows the right-hand side of Newton’s formula 
corresponding to that function. 


TABLE B-1. NEWTON’S METHOD FOR COMPUTING CERTAIN 


NUMBERS 
Quantity to Be 
Computed Function Iterative Formula 
4 y 
Ja - = 


log,a 


It is not always easy, incidentally, to find a good function to 
use. There are, of course, many functions that have the desired 
quantity as a root, and only a few of them lead to a useful 


iterative formula. Usually, the function to use is a sort of inverse 


of the desired computation. For example, to find Ja use f(x) = 
x? — a; to find logoa use f(x) = 2x — a, and so on.1 

The iterative formula for 10820 converges (to 10820) even if 
the multiplier 1/1n2 is altered somewhat (for example, to 1, or 
to 2). However, it then converges more slowly. A value of 3/2 or 
23/16 might be useful in some applications (1/In2 = 1.4427). 


Appendix C. A Gallery of Graphs of 
Discrete Functions 


This appendix shows plots of a number of discrete functions. 
They were produced by Mathematica. For each function, two 
plots are shown: one for a word size of three bits and the other 
for a word size of five bits. This material was suggested by Guy 
Steele. 


C-1 Plots of Logical Operations on Integers 


This section includes 3D plots of and(x, y), or(x, y), and xor(x, y) 
as functions of integers x and y, in Figures C-1, C-2, and C-3, 
respectively. 


FIGURE C-2. Plots of the logical or function. 


In Figure C-3, almost half of the points are hidden behind the 


diagonal plane * ^ Y. 


FIGURE C-3. Plots of the logical exclusive or function. 


For and(x, y) (Figure C-1), a certain self-similar, or fractal, 
pattern of triangles is apparent. If the figure is viewed straight 
on parallel to the y-axis and taken to the limit for large integers, 
the appearance would be as shown in Figure C-4. 


FIGURE С-4. Self-similar pattern made by and(x, у). 


This is much like the Sierpinski triangle [Sagan], except Figure 
C-4 uses right triangles whereas Sierpinski used equilateral 
triangles. In Figure C-3, a pattern along the slanted plane is 
evident that is precisely the Sierpinski triangle if carried to the 
limit. 


C-2 Plots of Addition, Subtraction, and 
Multiplication 


This section includes 3D plots of addition, subtraction, and three 
forms of multiplication of unsigned numbers, using “computer 
arithmetic,” in Figures C-5 through C-9. Note that for the plot 
of the addition operation, the origin is the far-left corner. 


FIGURE C-6. Plots of x – y (computer arithmetic). 


In Figure C-7, the vertical scales are compressed; the highest 
peaks in the left figure are of height 7-7 = 49. 


Plots of the unsigned product of x and y. 


FIGURE С-7. 


FIGURE C-8. Plots of the low-order half of the unsigned 


product of x and y. 


-9. Plots of the high-order half of the unsigned 


FIGURE C 


product of x and y. 
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C-3 Plots of Funct 


This section includes 3D plots of the quotient, remainder, 
greatest common divisor, and least common multiple functions 
of nonnegative integers x and y, in Figures C-10, C-11, C-12, 
and C-13, respectively. Note that in Figure C-10, the origin is 
the rightmost corner. 


FIGURE C-11. Plots of the remainder function rem(x, y). 


FIGURE C-12. Plots of the greatest common divisor function 
GCD(x, y). 


In Figure C-13, the vertical scales are compressed; the 
highest peaks in the left figure are of height LCM(6, 7) — 42. 


FIGURE C-13. Plots of the least common multiple function 
LCM(x, y). 


C-4 Plots of the Compress, SAG, and Rotate Left 
Functions 


This section includes 3D plots of compress(x, m), SAG(x, m), 
and rotate left x =r as functions of integers x, m, and r, in 
Figures C-14, C-15, and C-16, respectively 

For compress and SAG, m is a mask. For compress, bits of x 
selected by m are extracted and compressed to the right, with 0- 
fill on the left. For SAG, bits of x selected by m are compressed 
to the left, and the unselected bits are compressed to the right. 


FIGURE C-14. Plots of the generalized extract, or compress(x, 
m) function. 


FIGURE C-16. Plots of the rotate left function x Z r 


ς-5 2D Plots of Some Unary Functions 


Figures C-17 through C-21 show 2D plots of some unary 
functions on bit strings that are reinterpreted as functions on 
integers. Like the 3D plots, these were also produced by 
Mathematica. For most functions, two plots are shown: one for a 
word size of four bits and the other for a word size of seven bits. 


FIGURE C-19. Plots of the ruler function (number of trailing 
zeros). 


FIGURE С-20. Plots of the population count function 
(number of 1-bits). 


FIGURE C-21. Plots of the bit reversal function. 


"Gray code function" refers to a function that maps an integer 
that represents a displacement or rotation amount to the Gray 
encoding for that displacement or rotation amount. The inverse 
Gray code function maps a Gray encoding to a displacement or 
rotation amount. See Figure 13-1 on page 313. 


Figure C-22 shows what happens to a deck of 16 cards, 
numbered O to 15, after one, two, and three outer perfect 
shuffles (in which the first and last cards do not move). The x 
coordinate is the original position of a card, and the y coordinate 
is the final position of that card after one, two, or three shuffles. 
Figure C-23 is the same for one, two, and three perfect inner 
shuffles. Figures C-24 and C-25 are for the inverse operations. 
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FIGURE C-22. Plots of the outer perfect shuffle function. 


FIGURE C-25. Plots of the inner perfect unshuffle function. 


Figures C-26 and C-27 show the mapping that results from 
shuffling the bits of an integer of four and eight bits in length. 
Informally, 


shuffleBits(x) — asInteger(shuffle(bits(x))) 


FIGURE C-27. Plots of the inner perfect shuffle bits function. 
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dividing by. See Division of integers by constants. 
multiplying by, 175-178 
Counting bits. See also ntz (number of trailing zeros) function; nlz 
(number of leading zeros) function; population count 
function. 
1-bits in 
7- and 8-bit quantities, 87 
an array, 89-95 
a word, 81-88 
bitsize function, 106-107 
comparing two words, 88-89 
divide and conquer strategy, 81-82 
leading 05, with 
binary search method, 99-100 
floating-point methods, 104-106 
population count instruction, 101-102 
rotate and sum method, 85-86 
search tree method, 109 
with table lookup, 86-87 
trailing 0%, 107-114 
by turning off 1-bits, 85 
CRC (cyclic redundancy check) 
background, 319-320 
check bits, generating, 319-320 
checksum, computing 
generator polynomials, 322-323, 329 
with hardware, 324-326 
with software, 327-329 
with table lookup, 328-329 
techniques for, 320 
code vector, 319 
definition, 319 
feedback shift register circuit, 325-326 
generator polynomial, choosing, 322-323, 329 
parity bits, 319-320 


practice 
hardware checksums, 324-326 
leading zeros, detecting, 324 
overview, 323-324 
residual/residue, 324 
software checksums, 327-329 
trailing zeros, detecting, 324 
theory, 320-323 
CRC codes, generator polynomials, 322, 323 
CRC-CITT polynomial, 323 
Cryptography 
Advanced Encryption Standard, 164 
bitgather instruction, 164-165 
DES (Data Encryption Standard), 164 
Rijndael algorithm, 164 
SAG method, 162-165 
shuffling bits, 139-141, 165 
Triple DES, 164 
CSA (carry-save addr) circuit, 90-95 
Cube root, approximate, floating-point, 389 
Cube root, integer, 287-288 
Curves. See also Hilbert‘s curve. 
Peano, 371-372 
space-filling, 355-372 
Cycling among values, 48-51 


D 


Davio decomposition, 51-53, 56-57 
de Bruijn cycles, 111-112 

de Kloet, David, 55 

De Morgan's laws, 12-13 

DEC PDP-10 computer, xiii, 84 
Decryption. See Cryptography. 

DES (Data Encryption Standard), 164 
Dietz‘s formula, 19, 55 

difference or zero (doz) function, 41-45 


Distribution of leading digits, 385-387 
Divide and conquer strategy, 81-82 
Division 
arithmetic tables, 455 
doubleword 
from long division, 197-202 
signed, 201-202 
by single word, 192-197 
unsigned, 197-201 
floor, 181-182, 237 
modulus, 181-182, 237 
multiword, 184-188 
of negabinary numbers, 302-304 
nonrestoring algorithm, 192-194 
notation, 181 
overflow detection, 34-36 
plots and graphs, 463-464 
restoring algorithm, 192-193 
shift-and-subtract algorithms (hardware), 192-194 
short, 189-192, 195-197 
signed 
computer, 181 
doubleword, 201-202 
long, 189 
multiword, 188 
short, 190-192 
unsigned 
computer, 181 
doubleword, 197-201 
long, 192-197 
short from signed, 189-192 
Division of integers by constants 
by 3, 207-209, 276-277 
by 5 and 7, 209-210 
exact division 
converting to, 274-275 


definition, 240 
multiplicative inverse, Euclidean algorithm, 242-245 
multiplicative inverse, Newton's method, 245-247 
multiplicative inverse, samples, 247-248 
floor division, 237 
incorporating into a compiler, signed, 220-223 
incorporating into a compiler, unsigned, 232-234 
magic numbers 
Alverson's method, 237-238 
calculating, signed, 212-213, 220-223 
calculating, unsigned, 231-234 
definition, 211 
sample numbers, 238-239 
table lookup, 237 
uniqueness, 224 
magicu algorithm, 232-234 
magicu2 algorithm, 236 
modulus division, 237 
remainder by multiplication and shifting right 
signed, 273-274 
unsigned, 268-272 
remainder by summing digits 
signed, 266-268 
unsigned, 262-266 
signed 
by divisors < -2, 218-220 
by divisors 2 2, 210-218 
by powers of 2, 205-206 
incorporating into a compiler, 220-223 
not using mulhs (multiply high signed), 259-262 
remainder by multiplication and shifting right, 273-274 
remainder by summing digits, 266-268 
remainder from powers of 2, 206-207 
test for zero remainder, 250-251 
uniqueness, 224 
timing test, 276 


z 
= 


unsigned 
best programs for, 234-235 
by 3 and 7, 227-229 
by divisors 2 1, 230-232 
by powers of 2, 227 
incorporating into a compiler, 232-234 
incremental division and remainder technique, 232-234 


not using mulhu (multiply high unsigned) instruction, 251- 
259 


remainder by multiplication and shifting right, 268-272 
remainder by summing digits, 262-266 
remainder from powers of 2, 227 
test for zero remainder, 248-250 
Double buffering, 46 
Double-length addition/subtraction, 38-39 
Double-length shifts, 39-40 
Doubleword division 
by single word, 192-197 
from long division, 197-202 
signed, 201-202 
unsigned, 197-201 
Doublewords, definition, 1 
doz (difference or zero) function, 41-45 
Dubé, Danny, 112 


E 


ECCs (error-correcting codes) 
check bits, 332 
code, definition, 343 
code length, 331, 343 
code rate, 343 
code size, 343 
coding theory problem, 345-351 
efficiency, 343 
FEC (binary forward error-correcting block codes), 331 
Gilbert-Varshamov bound, 348-350 


Hamming bound, 348, 350 
Hamming code, 332-342 
converting to SEC-DED code, 334-337 
extended, 334-337 
history of, 335-337 
overview, 332-334 
SEC-DED on 32 information bits, 337-342 
Hamming distance, 95, 343-345 
information bits, 332 
linear codes, 348—349 
overview, 331, 342-343 
perfect codes, 333, 349, 352 
SEC (single error-correcting) codes, 331 


SEC-DED (single error-correcting, double error-detecting) 
codes 


on 32 information bits, 337—342 
check bits, minimum required, 335 
converting from Hamming code, 334-337 
definition, 331 

singleton bound, 352 

sphere-packing bound, 348, 350 

spheres, 347-351 


Encryption. See Cryptography. 

End-around-carry, 38, 56, 304-305 

Error detection, digital data. See CRC (cyclic redundancy check). 
Estimating multiplication overflow, 33-34 

Euclidean algorithm, 242-245 

Euler, Leonhard, 392 

Even parity, 96 

Exact division 


definition, 240 

multiplicative inverse, Euclidean algorithm, 242-245 
multiplicative inverse, Newton's method, 245-247 
multiplicative inverse, samples, 247—248 

overview, 240-242 


Exchanging 


conditionally, 47 
corresponding register fields, 46 
two fields in same register, 47 
two registers, 45-46 
exclusive or 
plots and graphs, 460 
propagating arithmetic bounds through, 77-78 
scan operation on an array of bits, 97 
in three instructions, 17 
Execution time model, 9-10 
Exercise answers. See Answers to exercises. 
Expand operation, 156-157, 159-161 
Exponentiation 
by binary decomposition, 288-290 
in Fortran, 290 
Extended Hamming code, 334-342 
on 32 information bits, 337-342 
Extract, generalized, 150-156 


F 


Factoring, 178 
FEC (binary forward error-correcting block codes), 331 
feedback shift register circuit, 325-326 
Fermat numbers, 391 
FFT (Fast Fourier Transform), 137-139 
find leftmost O-byte, 117-121 
find rightmost 0-Буге, 118-121 
Finding 
decimal digits, 122 
first 0-byte, 117-121 
first uppercase letter, 122 
length of character strings, 117 
next higher number, same number of 1-bits, 14-15 
the nth prime, 391-398, 403 
strings of 1-bits 
first string of a given length, 123-125 


longest string, 125-126 
shortest string, 126-128 
values within arithmetic bounds, 122 
Flipping bits, 135 
Floating-point numbers, 375-389 
distribution of leading digits, 385-387 
formats (single/double), 375-376 
gradual underflow, 376 
IEEE arithmetic standard, 375 
IEEE format, 375-377 
NaN (not a number), 375-376 
normalized, 375-377 
subnormal numbers, 375-377 
table of miscellaneous values, 387-389 
ulp (unit in the last position), 378 
Floating-point operations 
approximate cube root, 389 
approximate reciprocal square root, 383-385 
approximate square root, 389 
comparing using integer operations, 381—382 
conversion table, 378-381 
converting to/from integers, 377-381 
counting leading 05 with, 104-106 
simulating, 107 
Floor division, 181-182, 237 
Floor function, identities, 183, 202-203 
Floyd, R. W., 114 
Formula functions, 398-403 
Formulas for primes, 391-403 
Fortran 
IDIM function, 44 
integer exponentiation, 290 
ISIGN function, 22 
MOD function, 182 
Fractal triangles, plots and graphs, 460 
Full adders, 90 


Full RISC instruction set, 7 
Fundamental theorem of arithmetic, 404 


G 


Gardner, Martin, 315 
Gaudet, Dean, 110 
Gaudet's algorithm, 110 
generalized extract operation, 150-156 
Generalized unshuffle. See SAG (sheep and goats) operation. 
Generator polynomials, CRC codes, 321-323 
Gilbert-Varshamov bound, 348-350 
Golay, M. J. E., 331 
Goryavsky, Julius, 103 
Gosper, R. W. 
iterating through subsets, 14-15 
loop-detection, 114-116 
Gradual underflow, 376 
Graphics-rendering, Hilbert's curve, 372-373 
Graphs. See Plots and graphs. 
Gray, Frank, 315 
Gray code 
applications, 315-317 
balanced, 317 
converting integers to, 97, 312-313 
cyclic, 312 
definition, 311 
history of, 315-317 
incrementing Gray-coded integers, 313-315 
negabinary Gray code, 315 
plots and graphs, 466 
reflected, 311-312, 315 
single track (STGC), 316-317 
Greatest common divisor function, plots and graphs, 464 
GRP instruction, 165 
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Hacker, definition, xvi 
HAKMEM (hacks memo), xiii 
Half shuffle, 141 
Halfwords, 1 
Hamiltonian paths, 315 
Hamming, R. W., 331 
Hamming bound, 348, 350 
Hamming code 
on 32 information bits, 337—342 
converting to SEC-DED code, 334-337 
extended, 334-337 
history of, 335-337 
overview, 332-334 
perfect, 333, 352 
Hamming distance, 95, 343-345 
triangle inequality, 352 
Hardware checksums, 324-326 
Harley, Robert, 90, 101 
Harley's algorithm, 101, 103 
Hexadecimal floating-point, 385 
High-order half of product, 173-174 
Hilbert, David, 355 
Hilbert's curve. See also Space-filling curves. 
applications, 372-373 
coordinates from distance 
curve generator driver program, 359 
description, 358-366 
Lam and Shapiro method, 362-364, 368 
parallel prefix operation, 365-366 
state transition table, 361, 367 
description, 355-356 
distance from coordinates, 366-368 
generating, 356-358 
illustrations, 355, 357 
incrementing coordinates, 368-371 
non-recursive generation, 371 


ray tracing, 372 
three-dimensional analog, 373 
Horner's rule, 49 
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IBM 
Chipkill technology, 336 
Harvest computer, 336 
PCs, error checking, 336 
PL/I language, 54 
Stretch computer, 81, 336 
System/360 computer, 385 
System/370 computer, 63 
IDIM function, 44 
IEEE arithmetic standard, 375 
IEEE format, floating-point numbers, 375-377 
IEEE Standard for Floating-Point Arithmetic, 375 
Image processing, Hilbert's curve, 372 
Incremental division and remainder technique, 232-234 
Inequalities, logical and arithmetic expressions, 17-18 
Information bits, 332 
Inner perfect shuffle function, plots and graphs, 468-469 
Inner perfect unshuffle function, plots and graphs, 468 
Inner shuffle, 139-141 
insert instruction, 155-156 
Instruction level parallelism, 9 
Instruction set for this book, 5-8 
integer cube root function, 287-288, 297 
Integer exponentiation, 288-290 
integer fourth root function, 297 
integer log base 2 function, 106, 291 
integer log base 10 function, 292-297 
Integer quotient function, plots and graphs, 463 
integer remainder function, 463 
integer square root function, 279—287 
Integers. See also specific operations on integers. 


complex, 306-309 
converting to/from floating-point, 377-381 
converting to/from Gray code, 97, 312-313 
reversed, incrementing, 137-139 
reversing, 129-137 
Inverse Gray code function 
formula, 312 
plots and graphs, 466 
An Investigation of the Laws of Thought, 54 
ISIGN (transfer of sign) function, 22 
Iterating through subsets, 14-15 
ITU-TSS (International Telecommunications Union...), 321 
ITU-TSS polynomial, 323 
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Knuth, Donald E., 132 

Knuth's Algorithm D, 184-188 

Knuth's Algorithm M, 171-172, 174-175 
Knuth's mod operator, 181 

Kronecker, Leopold, 375 


L 


Lam and Shapiro method, 362-364, 368 

Landry, F., 391 

Leading 05, counting, 99-106. See also nlz (number of leading 
zeros) function. 


Leading 0’s, detecting, 324. See also CRC (cyclic redundancy 
check). 
Leading digits, distribution, 385-387 
Least common multiple function, plots and graphs, 464 
Linear codes, 348—349 
Little-endian format, converting to/from big-endian, 129 
load word byte-reverse (1worx) instruction, 118 
Logarithms 
binary search method, 292-293 
definition, 291 


log base 2, 106-107, 291 
log base 10, 291-297 
table lookup, 292, 294—297 
Logical operations 
with addition and subtraction, 16-17 
and, plots and graphs, 459 
binary, table of, 17 
exclusive or, plots and graphs, 460 
or, plots and graphs, 459 
propagating arithmetic bounds through, 74—76, 78 
tight bounds, 74—78 
Logical operators on integers, plots and graphs, 459-460 
Long Division, definition, 189 
Loop detection, 114-115 
LRU (least recently used) algorithm, 166-169 
lwbrx (load word byte-reverse) instruction, 118 


M 


MacLisp, 55 

magic algorithm 
incremental division and remainder technique, 232-234 
signed division, 220-223 
unsigned division, 232-234 

Magic numbers 
Alverson's method, 237-238 
calculating, signed, 212-213, 220-223 
calculating, unsigned, 232-234 
calculating, Python code for 
definition, 211 
samples, 238-239 
table lookup, 237 
uniqueness, 224 

magicu algorithm, 232-234 
in Python, 240 

magicu2 algorithm, 236-237 

max function, 41-45 


Mills, W. H., 403 
Mills's theorem, 403-404 
min function, 41—45 
MIT PDP-6 Lisp, 55 
MOD function (Fortran), 182 
modu (unsigned modulus) function, 98 
Modulus division, 181-182, 237 
Moore, Eliakim Hastings, 371-372 
mulhs (multiply high signed) instruction 
division with, 207-210, 212, 218, 222, 235 
implementing in software, 173-174 
not using, 259-262 
mulhu (multiply high unsigned) instruction 
division with, 228-229, 234-235, 238 
implementing in software, 173 
not using, 251-259 
Multibyte absolute value, 40-41 
Multibyte addition/subtraction, 40-41 
Multiplication 
arithmetic tables, 454 
of complex numbers, 178-179 
by constants, 175-178 
factoring, 178 
low-order halves independent of signs, 178 
high-order half of 64-bit product, 173-174 
high-order product signed from/to unsigned, 174-175 
multiword, 171-173 
of negabinary numbers, 302 
overflow detection, 31-34 
plots and graphs, 462 
Multiplicative inverse 
Euclidean algorithm, 242-245 
Newton‘s method, 245-247, 278 
samples, 247-248 
multiply instruction, condition codes, 36-37 
Multiword division, 184—189 


Multiword multiplication, 171—173 
MUX operation in three instructions, 56 
mux (multiplex) instruction, 406 


N 


NAK (negative acknowledgment), 319 
NaN (not a number), 375-376 
Negabinary number system, 299-306 
Gray code, 315 
Negative absolute value, 23-26 
Negative overflow, 30 
Newton-Raphson calculation, 383 
Newton's method, 457—458 
integer cube root, 287-288 
integer square root, 279-283 
multiplicative inverse, 245-248 
Next higher number, same number of 1-bits, 14-15 
Nibbles, 1 
nlz (number of leading zeros) function 
applications, 79, 107, 128 
bitsize function, 106-107 
comparison predicates, 23-24, 107 
computing, 99-106 
for counting trailing 05, 107 
finding O-bytes, 118 
finding strings of 1-bits, 123-124 
incrementing reversed integers, 138 
and integer log base 2 function, 106 
rounding to powers of 2, 61 
Nonrestoring algorithm, 192-194 
Normalized numbers, 376 
Notation used in this book, 1-4 
nth prime, finding 
formula functions, 398-401 
Willans's formulas, 393-397 
Wormell's formula, 397-398 


ntz (number of trailing zeros) function 
applications, 114—116 
from counting leading 05, 107 
loop detection, 114-115 
ruler function, 114 

Number systems 
base -1 + i, 306-308 
base -1 - i, 308-309 
base -2, 299-306, 315 
most efficient base, 309-310 
negabinary, 299-306, 315 


О 


Odd parity, 96 
1-bits, counting. See Counting bits. 
or 

plots and graphs, 459 

in three instructions, 17 
Ordinary arithmetic, 1 
Ordinary rational division, 181 


Outer perfect shuffle bits function, plots and graphs, 469 
Outer perfect shuffle function, plots and graphs, 467 
Outer perfect unshuffle function, plots and graphs, 468 


Outer shuffle, 139-141, 373 
Overflow detection 
definition, 28 
division, 34-36 


estimating multiplication overflow, 33-34 


multiplication, 31—34 
negative overflow, 30 
signed add/subtract, 28-30 
unsigned add/subtract, 31 


P 


Parallel prefix operation 
definition, 97 


Hilbert's curve, 364—366 
inverse, 116 
parity, 97 
Parallel suffix operation 
compress operation, 150-155 
expand operation, 156-157, 159-161 
generalized extract, 150-156 
inverse, 116 
Parity 
adding to 7-bit quantities, 98 
applications, 98 
computing, 96-98 
definition, 96 
parallel prefix operation, 97 
scan operation, 97 
two-dimensional, 352 
Parity bits, 319-320 
PCs, error checking, 336 
Peano, Giuseppe, 355 
Peano curves, 371—372. See also Hilbert's curve. 
Peano-Hilbert curve. See Hilbert's curve. 
Perfect codes, 333, 349 
Perfect shuffle, 139-141, 373 
Permutations on bits, 161—165. See also Bit operations. 
Planar curves, 355. See also Hilbert's curve. 
Plots and graphs, 459-469 
addition, 461 
bit reversal function, 467 
compress function, 464—465 
division, 463—464 
fractal triangles, 460 
Gray code function, 466 
greatest common divisor function, 464 
inner perfect shuffle, 468-469 
inner perfect unshuffle, 468 
integer quotient function, 463 


inverse Gray code function, 466 
least common multiple function, 464 
logical and function, 459 
logical exclusive or function, 460 
logical operators on integers, 459-460 
logical or function, 459 
multiplication, 462 
number of trailing zeros, 466 
outer perfect shuffle, 467—469 
outer perfect unshuffle, 468 
population count function, 467 
remainder function, 463 
rotate left function, 465 
ruler function, 466 
SAG (sheep and goats) function, 464—465 
self-similar triangles, 460 
Sierpinski triangle, 460 
subtraction, 461 
unary functions, 466-469 
unsigned product of x and y, 462 
Poetry, 278, 287 
population count function. See also Counting bits. 
applications, 95-96 
computing Hamming distance, 95 
counting 1-bits, 81 
counting leading 05, 101-102 
counting trailing 05, 107-114 
plots and graphs, 467 
Position sensors, 315-317 
Powers of 2 
boundary crossings, detecting, 63-64 
rounding to, 59-62, 64 
signed division, 205-206 
unsigned division, 227 
PPERM instruction, 165 
Precision, loss of, 385-386 


Prime numbers 
Fermat numbers, 391 
finding the nth prime 
formula functions, 398—403 
Willans‘s formulas, 393-397 
Wormell's formula, 397-398 
formulas for, 391-403 
from polynomials, 392 
Propagating arithmetic bounds 
add and subtract instructions, 70—73 
logical operations, 73-78 
signed numbers, 71-73 
through exclusive or, 77-78 
PSHUFB (Shuffle Packed Bytes) instruction, 163 
PSHUFD (Shuffle Packed Doublewords) instruction, 163 
psHuFw (Shuffle Packed Words) instruction, 163 


Q 
Quicksort, 81 


R 


Range analysis, 70 
Ray tracing, Hilbert's curve, 372 
Rearrangements and index transformations, 165-166 
Reed-Muller decomposition, 51-53, 56-57 
Reference matrix method (LRU), 166-169 
Reflected binary Gray code, 311-312, 315 
Registers 
exchanging, 45-46 
exchanging conditionally, 47 
exchanging fields of, 46-47 
reversing contents of, 129-135 
RISC computers, 5 
Reiser, John, 113 
Reiser's algorithm, 113-114 
Remainder function, plots and graphs, 463 


Remainders 
arithmetic tables, 456 
of signed division 
by multiplication and shifting right, 273-274 
by summing digits, 266-268 
from non-powers of 2, 207-210 
from powers of 2, 206-207 
test for zero, 248-251 
of unsigned division 
by multiplication and shifting right, 268-272 
by summing digits, 262-266 
and immediate instruction, 227 
incremental division and remainder technique, 232-234 
test for zero, 248—250 
remu function, 119, 135-136 
Residual/residue, 324 
Restoring algorithm, 192-193 
Reversing bits and bytes, 129-137 
6-, 7-, 8-, and 9-bit quantities, 135-137 
32-bit words, 129-135 
big-endian format, converting to little-endian, 129 
definition, 129 
generalized, 135 
load word byte-reverse (1worx) instruction, 118 
rightmost 16 bits of a word, 130 
with rotate shifts, 129—133 
small integers, 135-137 
table lookup, 134 
Riemann hypothesis, 404 
Right justify function, 116 
Rightmost bits, manipulating, 11-12, 15 
De Morgan's laws, 12-13 
right-to-left computability test, 13-14, 55 
Rijndael algorithm, 164 
RISC 
basic instruction set, 5-6 


execution time model, 9-10 
extended mnemonics, 6, 8 
full instruction set, 7-8 
registers, 5-6 
Rotate and sum method, 85-86 
Rotate left function, plots and graphs, 464—465 
Rotate shifts, 37-38, 129-133 
Rounding to powers of 2, 59-62, 64 
Ruler function, 114, 466 
Russian decomposition, 51-53, 56-57 
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SAG (sheep and goats) operation 
description, 162-165 
plots and graphs, 464—465 

Scan operation, 97 

Seal, David, 90, 110 

Search tree method, 109 

Searching. See Finding. 

SEC (single error-correcting) codes, 331 

SEC-DED (single error-correcting, double error-detecting) codes 
on 32 information bits, 337—342 
check bits, minimum required, 335 
converting from Hamming code, 334-335 
definition, 331 

Select instruction, 406 

Self-reproducing program, xvi 

Self-similar triangles, plots and graphs, 460 

shift left double operation, 39 

shift right double signed operation, 39-40 

shift right double unsigned operation, 39 

shift right extended immediate (shrxi) instruction, 228-229 

shift right signed instruction 
alternative to, for sign extension, 19-20 
division by power of 2, 205-206 
from unsigned, 20 


Shift-and-subtract algorithm 
hardware, 192-194 
integer square root, 285-287 
Shifts 
double-length, 39-40 
rotate, 37-38 
Short division, 189-192, 195-196 
Shroeppel's formula, 305-306 
shrxi (shift right extended immediate) instruction, 228-229 
Shuffle Packed Bytes (psHuFB) instruction, 163 
Shuffle Packed Doublewords (Ρ5ΗύΕΡ) instruction, 163 
Shuffle Packed Words (esnurw) instruction, 163 
Shuffling 
arrays, 165-166 
bits 
half shuffle, 141 
inner perfect shuffle, plots and graphs, 468-469 
inner perfect unshuffle, plots and graphs, 468 
inner shuffle, 139-141 
outer shuffle, 139-141, 373 
perfect shuffle, 139-141 
shuffling bits, 139-141, 165-166 
unshuffling, 140-141, 150, 162, 165-166 
Sierpinski triangle, plots and graphs, 460 
Sign extension, 19-20 
sign function, 20-21. See also three-valued compare function. 
Signed bounds, 78 
Signed comparisons, from unsigned, 25 
Signed computer division, 181-182 
Signed division 
arithmetic tables, 455 
computer, 181 
doubleword, 201-202 
long, 189 
multiword, 188 
short, 190-192 


Signed division of integers by constants 
best programs for, 225-227 
by divisors < -2, 218-220 
by divisors 2 2, 210-218 
by powers of 2, 205-206 
incorporating into a compiler, 220-223 
remainder from non-powers of 2, 207-210 
remainder from powers of 2, 206-207 
test for zero remainder, 250-251 
uniqueness of magic number, 224 
Signed long division, 189 
Signed numbers, propagating arithmetic bounds, 71-73 
Signed short division, 190-192 
signum function, 20-21 
Single error-correcting, double error-detecting (SEC-DED) codes. 
See SEC-DED (single error-correcting, double error- 
detecting) codes. 
Single error-correcting (SEC) codes, 331 
snoob function, 14-15 
Software checksums, 327-329 
Space-filling curves, 371—372. See also Hilbert's curve. 
Sparse array indexing, 95 
Sphere-packing bound, 348-350 
Spheres, ECCs (error-correcting codes), 347-350 
Square root, integer 
binary search, 281-285 
hardware algorithm, 285-287 
Newton's method, 279-283 
shift-and-subtract algorithm, 285-287 
Square root, approximate, floating-point, 389 
Square root, approximate reciprocal, floating-point, 383-385 
Stibitz, George, 308 
Strachey, Christopher, 130 
Stretch computer, 81, 336 
Strings. See Bit operations; Character strings. 
strlen (string length) C function, 117 


Subnormal numbers, 376 
Subnorms, 376 
subtract instruction 
condition codes, 36-37 
propagating arithmetic bounds, 70-73 
Subtraction 
arithmetic tables, 453 
difference or zero (doz) function, 41-45 
double-length, 38-39 
combined with logical operations, 16-17 
multibyte, 40-41 
of negabinary numbers, 301-302 
overflow detection, 29-31 
plots and graphs, 461 
Swap-and-complement method, 362-365 
Swapping pointers, 46 
System/360 computer, 385 
System/370 computer, 63 


T 


Table lookup, counting bits, 86-87 
three-valued compare function, 21—22. See also sign function. 
Tight bounds 
add and subtract instructions, 70—73 
logical operations, 74—79 
Timing test, division of integers by constants, 276 
Toggling among values, 48-51 
Tower of Hanoi puzzle, 116, 315 
Trailing zeros. See also ntz (number of trailing zeros) function. 
counting, 107-114 
detecting, 324. See also CRC (cyclic redundancy check). 
plots and graphs, 466 
Transfer of sign (ISIGN) function, 22 
Transposing a bit matrix 
8x8, 141-145 
32 x 32, 145-149 


Triangles 
fractal, 460 
plots and graphs, 460 
self-similar, 460 
Sierpinski, 460 
Triple DES, 164 
True/false comparison results, 23 
Turning off 1-bits, 85 
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Ulp (unit in the last position), 378 
Unaligned load, 65 
Unary functions, plots and graphs, 466-469 
Uniqueness, of magic numbers, 224 
Unshuffling 

arrays, 162 

bits, 140-141, 162, 468 
Unsigned division 

arithmetic tables, 455 

computer, 181 

doubleword, 197-201 

long, 192-197 

short from signed, 189-192 
Unsigned division of integers by constants 

best programs for, 234-235 

by 3 and 7, 227-229 

by divisors = 1, 230-232 

by powers of 2, 227 

incorporating into a compiler, 232-234 

incremental division and remainder technique, 232-234 

remainders, from powers of 2, 227 

test for zero remainder, 248-250 
unsigned modulus (modu) function, 84 
Unsigned product of x and y, plots and graphs, 462 
Uppercase letters, finding, 122 


V 
Voorhies, Douglas, 373 
W 


Willans, C. P., 393 
Willans‘s formulas, 393-397 
Wilson‘s theorem, 393, 403 
Word parity. See Parity. 
Words 
counting bits, 81-87 
definition, 1 
division 
doubleword by single word, 192-197 
Knuth's Algorithm D, 184-188 
multiword, 184—189 
signed, multiword, 188 
multiplication, multiword, 171-173 
reversing, 129-134 
searching for 
first O-byte, 117-121 
first uppercase letter, 122 
strings of 1-bits, 123-128 
a value within a range, 122 
word parallel operations, 13 
Wormell, C. P., 397 
Wormell's formula, 397-398 


Z 


zbytel function, 117-121 
zbyter function, 117-121 
Zero means 2n, 22-23 


Footnotes 


Foreword 


1. Why *HAKMEM"? Short for *hacks memo"; one 36-bit PDP-10 word could 
hold six 6-bit characters, so a lot of the names PDP-10 hackers worked 
with were limited to six characters. We were used to glancing at a six- 
character abbreviated name and instantly decoding the contractions. So 
naming the memo “HAKMEM” made sense at the time—at least to the 
hackers. 


Preface 


1. One such program, written in C, is: 
main()(char*p = "main()(char*p = %c%s%c; 
(void)printf(p,34,p,34,10);}%c”;(void)printf(p,34,p,34,10);} 


Chapter 2 


1. A variation of this algorithm appears in [H&S] sec. 7.6.7. 


2. This is useful to get unsigned comparisons in Java, which lacks unsigned 
integers. 


3. Mathematicians name the operation monus and denote it with —. The 
terms positive difference and saturated subtraction are also used. 


4. A destructive operation is one that overwrites one or more of its 
arguments. 


5. Horner’s rule simply factors out x. For example, it evaluates the fourth- 
degree polynomial ахі + bx? + cx? + dx + е as x (x(x(ax + b) + c) + 
d) + e. For a polynomial of degree n it takes n multiplications and n 
additions, and it is very suitable for the multiply-add instruction. 


6. Logic designers will recognize this as Reed-Muller, a.k.a positive Davio, 
decomposition. According to Knuth [Knu4, 7.1.1], it was known to I. I. 
Zhegalkin [Matematicheskii Sbornik 35 (1928), 311-369]. It is sometimes 
referred to as the Russian decomposition. 


7. The entire 335-page work is available at www.gutenberg.org/etext/15114. 
Chapter 3 

1. pop(x) is the number of 1-bits in x. 

Chapter 4 


1. In the sense of more compact, less branchy, code; faster-running code may 
result from checking first for the case of no overflow, assuming the limits 


are not likely to be large. 


Chapter 5 


1. A full adder is a circuit with three 1-bit inputs (the bits to be added) and 
two 1-bit outputs (the sum and carry). 


2. The flakiness is due to the way C is used. The methods illustrated would be 
perfectly acceptable if coded in machine language, or generated by a 
compiler, for a particular machine. 


Chapter 7 


1. Actually, the first shift left can be omitted, reducing the instruction count to 
126. The quantity mv comes out the same with or without it [Dalton]. 


2. If big-endian bit numbering is used, compress to the left all bits marked 
with 0’s, and to the right all bits marked with 1’s. 


Chapter 8 


1. Reportedly this was known to Gauss. 


Chapter 9 


1. I may be taken to task for this nomenclature, because there is no universal 
agreement that “modulus” implies “nonnegative.” Knuth’s “mod” operator 
[Кпи1] is the remainder of floor division, which is negative (or 0) if the 
divisor is negative. Several programming languages use “mod” for the 
remainder of truncating division. However, in mathematics, “modulus” is 
sometimes used for the magnitude of a complex number (nonnegative), 
and in congruence theory the modulus is generally assumed to be positive. 


2. Some do try. IBM’s PL.8 language uses modulus division, and Knuth’s 
MMIX machine’s division instruction uses floor division [Knu7]. 


3. One execution of the RS/6000’s compare instruction sets multiple status 
bits indicating less than, greater than, or equal. 


4. Actually, the restoring division algorithm can avoid the restoring step by 
putting the result of the subtraction in an additional register and writing 
that register into x only if the result of the subtraction (33 bits) is 
nonnegative. In some implementations this may require an additional 
register and possibly more time. 


Chapter 12 


1. The interested reader might warm up to this challenge. 


2. This is the way it was done at Bell Labs back in 1940 on George Stibitz’s 
Complex Number Calculator [Irvine]. 


Chapter 14 


1. Since renamed the ITU-TSS (International Telecommunications Union— 
Telecommunications Standards Sector). 


Chapter 15 


1. A perfect code exists for т = 2k - k - 1, k an integer—that is, т = 1, 4, 
11, 26, 57, 120..... 

(n) 

2. It is also called the “binomial coefficient” because \// is the coefficient of 
the term xr уп -r in the expansion of the binomial (х + y)n. 


Chapter 16 


1. Recall that a curve is a continuous map from a one-dimensional space to an 
n-dimensional space. 


Chapter 17 


1. This is not officially sanctioned C, but with almost all compilers it works. 


Chapter 18 


1. However, this is the only conjecture of Fermat known to be wrong [Wells]. 


2. Our apologies for the two uses of x in close proximity, but it's standard 
notation and shouldn't cause any difficulty. 


3. This is my terminology, not Willans's. 


4. We have slightly simplified his formula. 


Answers To Exercises 


1. Base -2 also has this property, but not base -1 + i. 


2. These formulas were found by the exhaustive expression search program 
Aha! (A Hacker's Assistant). 


Appendix B 


1. Newton's method for the special case of the square root function was 
known to Babylonians about 4,000 years ago. 


unsigned snoob(unsigned x) { 
unsigned smallest, ripple, ones; 


smallest = x & -x; 


ripple = x + smallest; 


ones = x ripple; 
ones = (ones >> 2)/smallest; 
return ripple | ones; 


unsigned x, y, z, m, n, t; 


- nlz(x); 
n = nlz(y); 
if (m + n <= 30) goto overflow; 
x*(y >> 1); 


if ((int)t « 0) goto overflow; 
t*2; 
if (y & 1) ( 
τς... Xe 
if (z < x) goto overflow; 


} 


// z is the correct product of x and 


subf R5,Ry,Rx 
subfc R6,Rx,Ry 
subfe R7,Ry,Rx 
subfe R8,R7,R5 


# R5 
# R6 
# R7 
# R8 


<-- 


Кх 
Ку 
Rx 
R5 


Ry. 


Rx, set carry. 


Ry * carry, 
R7 + carry, 


set carry. 
(set carry). 


shldi RT,RA,RB,I 
shrdi RT,RA,RB,I 


shldi RT,Rx,Rx,I 


sub Z,X,Y Set z = x - y. 
стрії t,x,y Set t = 1 if x « y, else 0. 
movne Zz,t,r0 Set z=0 if x « y. 


cmplt t,x,y Set t lif x « y, else 0. 
movne x,t,y Set x = y if x « y. 


sub 
sbb 
and 
add 


eax,ecx 
edx,edx 
eax,edx 
eax,ecx 


. 
, 
. 
, 
. 
, 
. 
, 


Inputs x and y are in eax and ecx resp. 
edx - 0 if x »- y, else -1. 

0 if x >= y, else x - y. 

Add y, giving y if x >= y, else x. 


if (x == а) x = b; 
else x = a; 


if (x == а) x b 
else if (x -- 


else x - a; 


b) x = c; 


unsigned flp2(unsigned x) ( 
х = | >> 1); 
>> 2); 
>> 4); 


>> 8); 
=x >> 16); 
return x - >> 1); 


y = 0x80000000; do ( 
while (y > x) у = x; 


y = y >> 1; х= ха (X - 1); 
return y; } while(x != 0); 
return y; 


while (y « x) // Unsigned comparison. 
y = 2*yi 
return y; 


unsigned clp2(unsigned x) ( 


x= x = 1; 

х= х | (x >> 1); 
x= x | (x >> 2); 
x= x | (x >> 4); 
x = x | (x >> 8); 
x =x | (x >> 16); 


return x + 1; 


O  RA,-A(-4096) 
ALR RA,RL 
BO CROSSES 


= a + oc; 
b + d; 
акса -з& -(b&d & -t); 


((a^c) | -(a^ s)) & (-b & -d & t); 
if ((u | v) < 0) 4 
= 0x80000000; 
Ox7FFFFFFF; } 


m = 0x80000000 >> nlz(a ^ c); 


unsigned minOR(unsigned a, unsigned b, 
unsigned c, unsigned d) ( 
unsigned m, temp; 


0x80000000; 
while (m != 0) ( 
if (-a & c & m) ( 
temp - (a | m) & -m; 
if (temp <= b) {a = temp; break;} 


else if (a & -c & m) ( 
temp = (c | m) & -m; 
if (temp <= d) (c = temp; break;} 


) 
m=m>> 1; 


} 


return а | c; 


m = 0x80000000 >> nlz(b & d); 


unsigned maxOR(unsigned a, unsigned b, 
unsigned c, unsigned d) ( 
unsigned m, temp; 


0x80000000; 
while (m != 0) ( 
if (b & d & m) ( 
temp = (b - m) | (m - 1); 
if (temp >= a) (b = temp; break;} 
temp - (d - m) | (m - 1); 
if (temp >= c) (d = temp; break;} 


) 
m=m>> 1; 


} 


return b | d; 


unsigned minAND(unsigned a, unsigned b, 
unsigned c, unsigned d) ( 
unsigned m, temp; 


m = 0x80000000; 
while (m != 0) ( 
if (-а & -c & m) ( 
temp - (a | m) & -m; 
if (temp <= b) (a = temp; break;} 
temp - (c | m) & -m; 
if (temp <= d) (c = temp; break;} 


) 
m = m >> 1; 


) 


return a & c; 


unsigned maxAND(unsigned a, unsigned b, 


unsigned c, unsigned d) ( 
unsigned m, temp; 


0x80000000; 
while (m != 0) ( 
if (b & -d & m) ( 
temp = (b & -m) | (m - 1); 
if (temp >= a) (b = temp; break;) 


else if (-b & d & m) ( 
temp - (d & -m) | (m - 1); 
if (temp >= c) (d = temp; break;} 


) 
m=m>> 1; 


} 


return b & d; 


temp - (b - m) 
if (temp >= a) 
else ( 
temp = (d - 
if (temp »- 


x MM жщ жм 


= (х 


(х 
(х 
(х 
(x 


0x55555555) 
0x33333333) 
OxOFOFOFOF) 
OxOOFFOOFF) 
0x0000FFFF) 


+ + + + + 


((x 
((x 
((x 
((x 
((x 


>> 
>> 
>> 
>> 
>> 


1) & 0x55555555); 
2) & 0x33333333); 
4) & OxOFOFOFOF); 
8) ἃ OxOOFFOOFF); 


16) & 0x0000FFFF); 


int pop(unsigned x) ( 

= x - ((x >> 1) & 0x55555555); 

= (x & 0x33333333) + ((x >> 2) & 0x33333333); 
(x + (x >> 4)) & OxOFOFOFOF; 

= x + (x >> 8); 

=x + (x >> 16); 

return x & 0x0000003F; 


x MM x OM 
II 


x - 3*((x >> 2) ἃ 0x33333333) 


int pop(unsigned x) ( 
unsigned n; 


n = (x >> 1) & 033333333333; // Count bits in 
x = x - n; // each 3-bit 

n = (n >> 1) & 033333333333; // field. 

x = x - n; 

x = (x + (x >> 3)) & 030707070707; // 6-bit sums. 
return x%63; // Add 6-bit sums. 


return ((x * 0404040404) >> 26) + // Add 6-bit sums. 
(x >> 30); 


int pop(unsigned x) ( 
unsigned n; 


= (x >> 1) & 0x77777111; 

= X - п; 

= (n >> 1) & 0х77777777; 

x - n; 

(n >> 1) & 0x77777777; 

x - n; 

= (x + (x >> 4)) & OxOFOFOFOF; 
= x*0x01010101; 

return x >> 24; 


ими ο κ 3  Ὦ 
It 


// 
// 
// 


// 
// 


Count bits in 
each 4-bit 
field. 


Get byte sums. 
Add the bytes. 


int pop(unsigned x) ( 
int n; 


0; 
while (x != 0) ( 


n=n+ 1; 
x= x & (x - 1); 


} 


return n; 


int pop(unsigned x) ( // Table lookup. 
static char table[256] = { 
0, 1, 1,:2, 1; 2, 2, ὃν ἂν 2, 2, 354 2, 3, 3, 4; 
45b, 5; Groy 6; 6, 7, 5, ὃ; Gr 7; бу y 7,8); 
return table[x & OxFF] + 
table[(x >> 8) & OxFF] + 
table[(x >> 16) & OxFF] + 
table[(x »» 24)]; 


int pop(unsigned x) ( 
int i, sum; 


// Rotate and sum method // Shift right & subtract 


sum xs; // sum = x; 

= 1; i <= 31; i++) ( // while (х != 0) 4 

rotatel(x, 1); // x >> 1; 
sum + x; // = sum - x; 


// ү 
return -sum; // return sum; 


for (i 


миня 


мия 


* 0х08040201; 
>> 3; 
& 0х11111111; 
* 0х11111111; 
>> 28; 


// 
// 
// 
// 
// 


Make 4 copies. 

So next step hits proper bits. 
Every 4th bit. 

Sum the digits (each 0 or 1). 
Position the result. 


x x жм 


κ κ ν Ἂ 


* 0x02040810; 
& 0x11111111; 
* 0x11111111; 
2» 28; 


// Make 4 copies, left-adjusted. 
// Every 4th bit. 

// Sum the digits (each 0 or 1). 
// Position the result. 


int pop(unsigned x) ( 
unsigned long long y; 
y = x * 0x0002000400080010ULL; 
y = y & 0x1111111111111111ULL; 
y = y * 0x1111111111111111ULL; 
у = у >> 60; 
return y; 


int popDiff(unsigned x, unsigned y) 
= x - ((x >> 1) & 0x55555555); 
(x & 0x33333333) + ((x >> 2) 0x33333333); 
"YR 
y - ((y >> 1) & 0x55555555); 
(y & 0x33333333) + ((y >> 2) & 0x33333333); 
х + y; 
(x & OxOFOFOFOF) + ((x >> 4) ἃ OxOFOFOFOF); 
= x + (x >> 8); 
x + (x >> 16); 
return (x & 0x0000007F) - 32; 


x 
x 
y 
y 
Y 
x 
x 
x 
x 


int popCmpr(unsigned xp, unsigned yp) ( 
unsigned x, y; 
X = xp & ~yp; // Clear bits where 
y = yp & ~xp; // both are 1. 
while (1) ( 


Xf ( < 0) return y | -y; 

if (y 0) return 1; 

х= ха (x- 1); // Clear one bit 
y = у & (y - 1); // from each. 


u = ones ^ A[i]; 
v = A[i*1]; 
twos = (ones & A[i]) | (u & v); 


ones = u ^ v; 


#define CSA(h,l, a,b,c) \ 
{unsigned u = a * b; unsigned v 
В = (a & b) | (u&v); L=Uu 


int popArray(unsigned A[], int п) { 


int tot, i; 
unsigned ones, twos; 


0 


; 
= 0; 


// Initialize. 


for (i 0; i<=n- 2; i= i+ 2) { 
CSA(twos, ones, ones, A[i], A[i*1]) 
tot = tot + pop(twos); 

} 

tot = 2*tot + pop(ones); 


if (n & 1) // If there's a last one, 
tot = tot + pop(A[i]); // add it in. 


return tot; 


int popArray(unsigned A[], int n) ( 


int tot, i; 
unsigned ones, twos, twosA, twosB, 
fours, foursA, foursB, eights; 


// Initialize. 


for (i = 0; i<=n- 8; i= i + 8) { 
CSA(twosA, ones, ones, A[i], A[i*1]) 
CSA(twosB, ones, ones, А[1+2], А[1+3]) 
CSA(foursA, twos, twos, twosA, twosB) 
CSA(twosA, ones, ones, А[1+4], А[1+5]) 
CSA(twosB, ones, ones, А[1+6], А[1+7]) 
CSA(foursB, twos, twos, twosA, twosB) 
CSA(eights, fours, fours, foursA, foursB) 
tot = tot + pop(eights); 

} 

tot = 8*tot + 4*pop(fours) + 2*pop(twos) + pop(ones); 


for (i = i; i < n; 111) // Simply add in the last 
tot = tot + pop(A[i]); // 0 to 7 elements. 
return tot; 


j = i >> 5; // j = 1/32. 

k =i & 31; // К = rem(i, 32); 
mask = 1 << k; // А "1" at position К. 
if ((bits[j] & mask) == 0) goto no such element; 
mask - mask - 1; // 1's to right of k. 


sparse i = bitsum[j] + pop(bits[j] & mask); 


у= y ^ (y >> 2); 
T у ^ (y >> 1); 


y = 0x6996 >> (y & OXF); 


"Ux κ OX 


x ^ (£ >> 1); 

ix ^ ix >> 2)) k 0x11111111; 
X*Uxllllllll: 

(x >> 28) & 1; 


if (x == 
n = 0; 

if (x <= 
if (x <= 
if (x <= 
if (x <= 
if (x <= 
return n; 


0) return(32); 


0x0000FFFF) 
0x00FFFFFF ) 
OxOFFFFFFF) 
Ox3FFFFFFF) 
Ox7FFFFFFF) 


”- ο ο Ss 


x κ κ Ox 


“хх Ox 


<<16;} 
<< 8;} 
<< 47} 
<< 21) 


if ((x 8 OxFFFF0000) == 0) (n = n +16; x = x ««16;) 
if ((х 8 OxFF000000) == 0) {п =n + 8; x = << 8} 


ñ = n + 1 — (x >> 31); 


if ((int)x <= 0) return (-x >> 26) & 32; 


int nlz(unsigned x) ( 
int n; 


if (x == 0) return(32); 
n = 1; 
if ((x >> 16) 


if ((x »» 24) 
if ((x »» 28) 
if ((x »» 30) 
n-2n- (x >> 
return n; 


static char table[256] = (0,1,2,2,3,3,3,3,4,4,...,8); 
return n - table[x]; 


int nlz(unsigned x) ( 
unsigned y; 
int n; 


- 32; 
2216; 


>> 8; 
>> 4; 
>> 
>> 


int nlz(unsigned x) ( 
unsigned y; 
int n, c; 


у = >> с; if (y ! 0) (n = n - 
c = c >> 1; 

) while (c != 0); 

return n - x; 


int nlz(int x) ( 
y, n; 


= x; 
if (x < 0) return n; 


if (y == 0) return 32 - n; 
+ 1; 
<< 1; 
>> 1; 

goto L; 


int nlz(unsigned x) ( 


int 


x Ὁ B< 


x B B< 


Y 
m 
n 
x 


x 5 aw 


Y: m, n; 


= -(x >> 16); 


(y >> 16) & 16; 


= 16 - m; 
= x >> m; 


= x - 0x100; 


(у >> 16) & 8; 


=n + m; 
= x << m; 


= x — 0x1000; 


(y >> 16) & 4; 


=n + m; 
= x << m; 


= X - 0x4000; 
= (y >> 16) & 2; 


n + m; 


= x << m; 


x >> 14; 
у: & =(y'>> 1); 


return n + 2 - m; 


If left half of x is 0, 

set n = 16. If left half 

is nonzero, set n = 0 and 
shift x right 16. 

Now x is of the form 0000xxxx. 
If positions 8-15 are 0, 

add 8 to n and shift x left 8. 


If positions 12-15 are 0, 
add 4 to n and shift x left 4. 


If positions 14-15 are 0, 
add 2 to n and shift x left 


or 3. 
or 2 resp. 


int nlz(unsigned x) ( 
pop(unsigned x); 


>> 
>> 


>> 


x 
return pop(-x); 


| 
| 
| >> 
| 
| 


int nlz(unsigned x) ( 


static char table[64] = 
(32,31, u,16, u,30, 3, u, 15, u, u, u,29,10, 
u, u,12,14,21, u,19, u, u,28, u,25, u, 9, 
17, u, 4, u, а, u,ll, u, 13,22,20, u,26, u, 
5, Uy 11,23, 1,27, Uy 6, u,24, 7, ч, 8, U; 


(x >> 1); // Propagate leftmost 


(x >> 4); 


| 
(x >> 2); // 1-bit to the right. 
g 
| 
| (х >> 8); 


x | (x >>16); 
x*0x06EB14F9; // Multiplier is 7*255**3. 
return table[x >> 26]; 


“ 4 - x 


(x 


= (X 


«« 
«« 
«« 
«« 


x; 
x; 
x; 
x; 


// Multiply by 7. 
// Multiply by 255. 
// Again. 

// Again. 


static char table[64] - 
(32,20,19, u, u,18, u, 7, 
u, 9, u,16, u, u, 1,26, 
u,21, u, 8,11, u,15, u, 


22, u,12, u, u, 3,28, u, 


X = X & ~(x >> 16); 
X*0xFD7049FF; 


u, u,14, u, 6, u, 
u, u,24, 5, u, u, 
u, 2,27, 0,25, u, 
4,29, u, u,30,31); 


int nlz(unsigned k) ( 
union ( 
unsigned asInt[2]; 
double asDouble; 
}; 


int n; 


asDouble = (double)k + 0.5; 
n = 1054 - (asInt[LE] >> 20); 
return n; 


xx = (double)k + 0.5; 
n = 1054 - (*((unsigned *)&xx + LE) >> 20); 


asDouble = (double)k; 


n 
n 


k 


1054 - (asInt[LE] >> 20); 
(п & 31) + (п >> 9); 


k & ~(k >> 1); 


asFloat = (float)k + 0.5f; 


n 


158 - (asInt >> 23); 


k = k & ~(k >> 1); 
asFloat = (float)k; 


n 
n 


158 - (asInt »» 23); 
(n & 31) * (n »» 6); 


#define NLZ(kp) \ 

({union {unsigned _asInt; float _asFloat;}; \ 
unsigned К = (kp), _kk = k & ~(_k >> 1); \ 
_asFloat = (float) kk + 0.5f; \ 

158 - ( asInt »» 23);)) 


х= x ^ (x >> 31); // If (x < 0) x = -x - 1; 
return 33 - nlz(x); 


32 - nlz(x ^ (x «« 1)) 


int ntz(unsigned x) ( 
int n; 


if (x == 0) return(32); 
1; 


if {τα 


8 ΟΧΟΟΟΟΕΕΕΕ) 
if ((х 8 0x000000FF) 
if ((х 8 0x0000000F) 
if ((х 8 0x00000003) 
return n - (x 8 1); 


((x << 1) »» 31); 


int ntz(unsigned x) ( 


unsigned y; 
int n; 


if (x == 0) 
- 31; 


««16; 
«« 8; 
«« 4; 
«« 2; 
<< 11 

return n; 


return 


if 
if 
if 
if 
if 


(у 
(y 
(y 
(y 
(y 


int ntz(char x) ( 
if (x & 15) ( 

if (x & 3) ( 
if (x & 1) return 0; 
else return 1; 

) 

else if (x & 4) return 2; 

else return 3; 


) 

else if (x & 0x30) ( 
if (x & 0x10) return 4; 
else return 5; 


) 
else if (x & 0x40) return 6; 


else if (x) return 7; 
else return 8; 


int ntz(unsigned x) ( 
int n; 


-X & (X - 1); 
n = 0; // n = 32; 


while (x != 0) { // while (x != 0) ( 
п + 1; If n =n - 1; 
x = x >> 1; ZZ x = x + x; 
) 128 
return n; // return n; 


int ntz(unsigned x) ( 
unsigned y, bz, b4, b3, b2, bl, b0; 


// Isolate rightmost 
// lif y = 0. 

16; 

8; 

4; 


0х0000ЕРЕР) 
0x00FFOOFF) 
OxOFOFOFOF) 
0x33333333) 2; 
0x55555555) 1; 
return bz + b4 + b3 + b2 + bl + b0; 


. 
: 
: 
. 
: 
. 
: 


ΓΩ 


(x << 4) + x; 
(x «c Бу + =: 
(x << 16) - x; 


// x 
// x 
// x 


x*17. 
x*65. 
x*65535. 


int ntz(unsigned x) ( 


Static char table[64] - 
432, 0,..1,12, 2, 6, 0,13; a, 7, 1, ü; 1, u,14, 
10, 4, ч, u, 8, ч, u,25, Ч, Ч, Ч, u,21,27,15, 


31,11, Sy Ч, ü, Uy Up Ч, Ч, u,24, u, 1,40,26, 
30, ü; üz ч, u,23, u,19, u,22,18,28,17,16, u); 


X = (ха -X)*0x0450FBAF; 
return table[x »» 26]; 


0000 0100 1101 0111 0110 0101 0001 1111. 


return table[x >> 27] + 32*(x == 0); 


int ntz(unsigned x) ( 


static char table[32] - 
{ 0, 1, 2,24, 3,19, 6,25, 22, 4,20,10,16, 7,12,26, 
31,23,18, 5,21, 9,15,11, 30,17, 8,14,29,13,28,27}; 


if (x == 0) return 32; 
X = (x & -x)*0x04D7651F; 
return table[x »» 27]; 


int ntz(unsigned x) ( 


static char table[37] = (32, 
3, 16, 24, 

7, 17, u, 25, 

6, u, 21, 14, 


X = (x & -x)$37; 
return table[x]; 


void 14 Gosper(int (*f)(int), int X0, int *mu 1, 
int *mu u, int *lambda) ( 
int Xn, К, m, kmax, n, 191; 
int T[33]; 


T[0] = X0; 
Xn - X0; 
for (п = 1; ; n**) { 
Xn = f(Xn); 
kmax = 31 - nlz(n); // Floor(1log2 n). 
for (k = 0; k <= kmax; k++) { 
if (Xn == T[k]) goto L; 
} 
T[ntz(n*1)] = Xn; // No match. 


} 


// Compute m = max{i | i < n and ntz(i*1) = К}. 


m= ((((n >> k) - 1) | 1) << k) - 1; 

*lambda = n - m; 

191 = 31 - nlz(*lambda - 1); // Ceil(log2 lambda) - 1. 
*mu u = m; // Upper bound on mu. 
*mu_l = м - max(l, 1 << 191) + 1;// Lower bound on mu. 


int zbytel(unsigned x) ( 

((x »» 24) 
if ((x 8 OxOOFF0000) 
if ((x & 0x0000FF00) 


if 

else 
else 
else 
else 


if ((x 8 0х000000ЕР) 
return 4; 


return 
return 
return 
return 


int zbytel(unsigned x) ( 
unsigned y; 
int n; 
// Original byte: 00 80 other 
= (x & Ox7F7F7F7F) + 0х7Е7Е7Е7Е, // ΤΕ ΤΕ lxxxxxxx 


= -(у | x | Ox7F7F7F7F); // 80 00 00000000 
= nlz(y) >> 3; //n=0... 4, 4 if x 
return n; // has no 0-byte. 


п = (32- п12(-у & (y - 1))) >> 3; 


y = (x - 0x01010101) & ~x & 0x80808080; 
n = ntz(y) >> 3; 


static char table[16] 


return table[y%127]; 


// Original byte: 
y = (x & Ox7F7F7F7F) + 0х7Е7Е7Е7Е, 
у = ~(y | x | Ox7F7F7F7F); 


if (y == 0) return 4; 
else if (y > 0х0000ЕЕЕЕ) 
return (y >> 31) ^ 1; 
else 
return (y >> 15) ^ 3; 


// 


00 80 other 

7F 7F 1xxxxxxx 
80 00 00000000 
These steps map: 
00000000 ==> 4, 
80xxxxxx ==> 0, 
0080xxxx ==> 1, 
000080xx ==> 2, 
00000080 ==> 3. 


return table[hopu(y, 0x02040810) & 15]; 
return table[y*0x00204081 »» 28]; 


return table[y$511]; 


y = (x & Ox7F7F7F7F) + Ox7F7F7F7F; 
y = y | x; // Leading 1 on nonzero bytes. 


tl = y >> 31; // tl- a. 

t2 = (y >> 23) & t1; // t2 = ab. 
t3 (y >> 15) & t2; // +3 abc. 
+4 = (y >> 7) & t3; // t4 = abcd. 
return tl + t2 + t3 + t4; 


к к к< }ς 


з 


(x & Ox7F7F7F7F) + 0x76767676; 
=y | x; 


y | 0х7Е7Е7Е7Е, 
—У; 


nlz(y) >> 3; 


// Bytes » 9 are OxFF. 
// Bytes » 9 are 0x00, 
// bytes <= 9 are 0x80. 


KKM p. p. 


5 


= (x | 0x80808080) - 0x41414141; 
-((x | Ox7F7F7F7F) ^ d); 
(d & Ox7F7F7F7F) * 0x66666666; 


у | а; 
у | 0х7Е7Е7Е7Е, 
~y; 


nlz(y) >> 3; 


// Bytes not from 41-5A are FF. 
// Bytes not from 41-5A are 00, 
// bytes from 41-5A are 80. 


int ffstrl(unsigned x, int 
int k, p; 


p = 0; // 
while (x != 0) ( 
k = nlz(x); £? 
x = x << k; // 
p =p + k; 
k = nlz(-x); // 
if (k >= n) // 
return p; // 
x = x << k; // 
p = р + k; // 
} 


return 32; 


п) 4 


Initialize position to return. 


Skip over initial 0's 
(if any). 


Count first/next group of 178. 
If enough, 

return. 

Not enough 1’s, skip over 
them. 


int ffstrl(unsigned x, int n) 
int s; 


while (n > 1) 4 
=n>> 1; 
x & (X << s); 
=n- s; 


return nlz(x); 


int maxstrl(unsigned x) ( 
int k; 
for (k = 0; x != 0; k++) x = x & 2*x; 


return k; 


x = 0011 1111 1111 0011 1111 0011 1111 1000 


x2 = 0011 1111 1110 0011 1110 0011 1111 0000 
x4 0011 1111 1000 0011 1000 0011 1100 0000 
x8 0011 1000 0000 0000 0000 0000 0000 0000 
x16 all 0's 


int fmaxstrl(unsigned x, int *apos) ( 
unsigned y; 
int s; 


if (x == 0) {*apos = 32; return 0;} 
у = х & (x << 1); 

if (y == 0) (s = 1; goto L1;) 

х= у & (y << 2); 

if (x == 0) (s = 2; x = y; goto 12;) 
y = ха (x << 4); 

if (y 0) (s = 4; goto L4;} 

x= y & (у << 8); 

if (x 0) (s = 8; x = y; goto L8;) 
if (x OxFFFF8000) {*apos = 0; return 32;) 
S = 16; 


L16: y » x & (x «« 8); 
if (y != 0) (s = s 

L8: у= x & (x << 4); 
if (y != 0) (s = s 
y = x & (x << 2); 
if (y != 0) {5-55 
y = x & (x << 1); 
if (y != 0) (s = s 
*apos = nlz(x); 

return s; 


х = 0011 1111 1111 0011 1111 0011 1111 1000 


b = 0010 0000 0000 0010 0000 0010 0000 0000 
e - 0000 0000 0001 0000 0001 0000 0000 1000 


int fminstrl(unsigned x, int *apos) ( 
int k; 
unsigned b, e; // Beginnings, ends. 


if (x == 0) {*apos = 32; return 0;} 
b = -(х >> 1) & x; // 0-1 transitions. 


e =x & ~(x << 1); // 1-0 transitions. 
for (k = 1; (b & e) == 0; k++) 
е =е << 1; 
*ароз = nlz(b & e); 
return k; 


4 4 4 xx 


m m m m m 


0x55555555) 
0x33333333) 
OxOFOFOFOF) 
0x00FFOOFF) 
0x0000FFFF) 


> BW Bw 


ОХАААААААА ) 
0xCCCCCCCC) 
ΟΧΕΟΕΟΕΟΕΟ) 
ΟΧΕΕΟΟΕΕΟΟ) 
OxFFFF0000) 


unsigned rev(unsigned x) ( 
X = (x & 0x55555555) << 1 | (x >> 1) ἃ 0x55555555; 
x (x & 0x33333333) << 2 | (x >> 2) & 0x33333333; 
x = (x & OxXOFOFOFOF) << 4 | (x > 4) ἃ OxOFOFOFOF; 
x 


(х << 24) | ((х & OxFF00) << 8) | 
((x >> 8) & OxFF00) | (x >> 24); 
return x; 


0000 
0000 
mnop 
mnkl 
mlkj 
lkji 


0000 
ijkl 
ijkl 
ijgh 
ihgf 
hgfe 


0000 
mnop 
efgh 
efcd 
edcb 
dcba 


efgh ijkl mnop 
efgh .... .... 


............ 


Given 
After 
After 
After 
After 
After 


µ κ жм жм 


x | ((x & 0x000000FF) << 16); 

(x & OxFOFOFOFO) | ((x & OxOFOFOFOF) << 8); 
(x & Oxcccccccc) | ((x & 0x33333333) << 4); 
(X & OXAAAAAAAA) | ((x & 0x55555555) << 2); 
x << 1; 


κ ν MM OM 


shlr(x 


= shlr(x 
= shlr(x 
= shlr(x 
= shlr(x, 


8 ΟΧΟΟΕΕΟΟΕΕ, 
& OxOFOFOFOF, 
& 0x33333333, 
& 0x55555555, 


1); 


“ч MM κ 
m m m m 


-0x00FFOO0FF; 
~Ox0FOFOFOF; 
~0x33333333; 
~0x55555555; 


x = shlr(x, 16) ἃ OxOOFFOOFF | х ἃ ~Ox00FFOOFF; 


x = ((shlr(x, 16) ^ x) ἃ ΟΧΟΟΕΕΟΟΕΕ) ^ x; 


unsigned rev(unsigned x) 
unsigned t; 
х 8 ΟΧΟΟΕΕΟΟΕΕ; - shlr(t, 16) 
= X & OxOFOFOFOF; = shlr(t, 8) 
= x & 0x33333333; = shlr(t, 4) 


X & 0x55555555; shlr(t, 2) 
shlr(x, 1); 
return x; 


012345678 9abcdefgh ijklmnopq 
ijklmnopq 9abcdefgh 012345678 
opqimnijk fghcde9ab 678345012 
qponmlkji hgfedcba9 876543210 


The given 27-bit word 
First ternary swap 
Second ternary swap 
Third ternary swap 


(х ἃ 0х000001ЕР) << 18 | (x ἃ 0x0003FEO00) | 
(x >> 18) & 0x000001FF; 
(x & 0х001С0Е07) << 6 | (х 8 0х00Е07038) | 
(x >> 6) ἃ 0x001COE07; 
(x & 0x01249249) << 2 | (x & 0x02492492) | 
(x >> 2) & 0x01249249; 


01234567 89abcdef ghijklmn 
fghijklm nopgrstu v0123456 
pqrstuvm nofghijk labcde56 
tuvspqrm nojklifg hebcda96 
vutsrqpo mnlkjihg fedcba98 


opqrstuv 
789abcde 
78901234 
78541230 
76543210 


Given 

Rotate left 15 
10-swap 

4-swap 

2-swap 


- shlr(x, 15); // Rotate left 15. 
= (x & 0x003F801F) << 10 | (x & 0x01C003E0) | 


(x >> 10) & 0x003F801F; 
(x & 0x0E038421) << 4 | (x & 0x11C439CE) | 
(x >> 4) & 0x0E038421; 
(x & 0x22488842) << 2 | (x & 0x549556B5) | 
(x >> 2) & 0x22488842; 


unsigned rev(unsigned x) { 
unsigned t; 


- shlr(x, 15); // Rotate left 15. 
(x ^ (x>>10)) ἃ 0х003Е801Е; x = (t | (t<<10)) ^ x 


= (x ^ (x>> 4)) & 0x0E038421; x = (t | (t<< 4)) ^ 
= (x (хээ 2)) ἃ 0x22488842; x = (t | (t<< 2)) ^ 
return x; 


x = (х & М1) << s | (x & M2) | (x >> s) ἃ ΜΙ; 


$ = (x^ (x >> s)) & Ml; x = (t | (t << s)) ^ x; 


x = (x << 15) | (x >> 17); // Rotate left 15. 


unsigned long long rev(unsigned long long x) ( 


unsigned long long t; 


= (x << 31) | (x >> 33); 


x oct ct M oct x oc X 
Ш 
p 


^ 


^ 
^ 
^ 
. 
x; 


>> 
<< 
>> 
<< 
>> 
<< 
>> 
<< 


// I.e., shlr(x, 


20)) & 0x00000FFF800007FFLL; 
20)) ^ x; 


8)) 
8)) 
4)) 
4)) 
2)) 
2)) 


& 


^ 


& 


› @ 


0x00F8000F80700807LL; 
x; 
0x0808708080807008LL; 
x; 
0x1111111111111111LL; 
x; 


31). 


unsigned rev(unsigned x) ( 
static unsigned char table[256] = (0x00, 0x80, 0x40, 
0хС0, 0x20, OxAO, 0x60, OxEO, ..., OxBF, Ox7F, OxFF); 
int i; 
unsigned r; 


r= 0; 

for (i = 3; i >= 0; i--) { 
г = (г << 8) + table[x & OxFF]; 
x = x >> 8; 

} 


return r; 


unsigned long long rev(unsigned long long x) ( 
unsigned long long t; 


<< 32) | (x >> 32); // Swap register halves. 
0x0001FFFFO001FFFFLL) << 15 | // Rotate left 
OxFFFEO000FFFEO000LL) >> 17; // 15. 

(x >> 10)) ἃ 0x003F801F003F801FLL; 

(t << 10)) ^ xi 

(x >> 4)) & 0x0E0384210E038421LL; 

(t << 4)) ^ xi 

(x >> 2)) & 0x2248884222488842LL; 

(t << 2)) ^ xi 


& 
& 
| 
ó 
| 
| 
х 


x x x x OM 


& 
& 
& 
& 
& 


0x55555555) 
0x33333333) 
OxOFOFOFOF) 
0x00FFOOFF) 
0x0000FFFF) 


m 


о c > N = 


m m m m m 


OxAAAAAAAA) 
0xCCCCCCCC) 
OxFOFOFOFO) 
OxFFOOFFO00) 
OxFFFF0000) 


0000 0000 0000 0000 0000 0000 abcd efgh ==> 
0000 0000 0000 0000 0000 0000 0000 bdfh 


0, 8, 4,C, 2, A, 6, B, 1, 9, 5, D, 3, B, 7, F. 


unsigned x, m; 


m = 0x80000000; 
x = х^ m; 
if ((int)x >= 0) í 
do ( 
m = m >> 1; 
x = x “ m; 
) while (x < m); 


abcd efgh ijkl mnop ABCD EFGH IJKL MNOP, 


aAbB cCdD eEfF gGhH iIjJ kK1L mMnN oOpP, 


AaBb CcDd EeFf GgHh IiJj ККІ1 MmNn OoPp. 


abcd 
abcd 
abcd 
abAB 
aAbB 


efgh 
efgh 
ABCD 
саср 
ссар 


ijkl mnop 
ABCD EFGH 
efgh EFGH 
efEF ghGH 
eEfF gGhH 


ABCD EFGH IJKL 
ijkl mnop IJKL 
ijkl IJKL mnop 
1117 klKL mnMN 
іІјЈ kK1L mMnN 


MNOP 
MNOP 
MNOP 
OpOP 
oOpP 


хх x κ 


(x 
(x 


= (x 


(x 


8 0x0000FF00) 
8 0х00ЕО00ЕО) 
8 0х0СОСОСОС) 
8 0х22222222) 


<< 
<< 
<< 
<< 


>> 
>> 
>> 
>> 


m m m m 


0x0000FF00 
0x00F000F0 
0x0C0C0C0C 
0x22222222 


x x x κ 


& OxFF0000FF; 
& OxFOOFFOOF; 
& 0xC3C3C3C3; 
& 0x99999999; 


ct oct ct oct 


^ 


^ 


^ 


^ 


>> 
>> 
>> 
>> 


8)) 
4)) 
2)) 
1)) 


0x0000FF00; 
0x00F000F0; 
0x0CO0COCOC; 
0x22222222; 


x MM OM 


4 MM x 


tot ct ct 


(t 
(t 
(t 
(t 


<< 
<< 
<< 
<< 


co oct ct oct 


^ 


^ 


^ 


^ 


>> 
>> 
>> 
>> 


1)) 
2)) 
4)) 
8)) 


0x22222222; 
0хосососос; 
0х00Е000ЕО: 
0x0000FF00; 


x MM x 


4 x xx 


tt ct ct ct 


(t << 
(t << 
(t << 
(t << 


х = (x >> 16) | (x << 16); 


0000 0000 0000 0000 ABCD EFGH IJKL MNOP 


0A0B OCOD OEOF OGOH 01027 OKOL OMON OOOP. 


“ MM ж 


((х & OxFF00) << 8) | (x & 0х00ЕЕ); 
((х << 4) | x) & OxOFOFOFOF; 
((х << 2) | х) & 0х33333333; 
((х << 1) | х) & 0х55555555; 


κ κ 4 жм ж 


x δ 


((x 


= τς 


((х 
((х 


0x55555555; 
>> 1) | x) 
>> 2) | x) 
>> 4) | x) 
>> 8) | x) 


© BR BM Р 


// (If required.) 
0x33333333; 
OxOFOFOFOF; 
OxOOFFOOFF; 
0x0000FFFF; 


01234567 89abcdef ghijklmn opqrstuv wxyzABCD EFGHIJKL MNOPQRST UVWXYZ$. 
0890%ЕМО 19hpxFNV 2aiqyGOW 3bjrzHPX 4cksAIQY 5dltBJRZ 6emuCKS$ 7fnvDLT. 


m m m OM m WM ϱ m 


>> 
>> 
>> 
>> 
>> 
>> 
>> 


0x8040201008040201LL 

0x0080402010080402LL) 
0x0000804020100804LL) 
0x0000008040201008LL) 
0x0000000080402010LL) 
0x0000000000804020LL) 
0x0000000000008040LL) 
0x0000000000000080LL) 


«« 
«« 
«« 
<< 
<< 
<< 
<< 


j 
14 
21 
28 
35 
42 
49 


7) 
14) 
21) 
28) 
35) 
42) 
49) 


& 


& 
& 
& 
& 
& 
& 


0x0080402010080402LL 
0x0000804020100804LL 
0x0000008040201008LL 
0x0000000080402010LL 
0x0000000000804020LL 
0x0000000000008040LL 
0x0000000000000080LL; 


t = (x ^ (x >> 7)) & 0x0080402010080402LL; 
x κο ο (6c «e 1 


t = (x ^ (x >> 7)) 8 0x00AA00AA00AAQ0AALL; 
x eG 1 


void transpose8(unsigned char A[8], int m, int n, 
unsigned char B[8]) ( 
unsigned long long x; 
int i; 


$ud 
« 8 


<= 7; 1++) // Load 8 bytes from the 
| A[m*i]; // input array and pack 
// them into x. 


X 8 OxAAS5AA55AA55AA55LL | 
(х ἃ 0x00AA00AAO00AAO00AALL) << 7 | 
(x >> 7) 8 0x00AA00AA00AAO00AALL; 

X & 0xCCCC3333CCCC3333LL | 
(х & 0x0000CCCCOO000CCCCLL) << 14 | 
(x >> 14) ἃ 0x0000CCCCO000CCCCLL; 
X & OxFOFOFOFOOFOFOFOFLL | 
(х ἃ 0x00000000FO0FOFOFOLL) << 28 | 
(х >> 28) ἃ 0x00000000F0FOFOFOLL; 


// Store result into 
// output array B. 


void transpose32(unsigned A[32]) ( 
int j, k; 
unsigned m, t; 


m = 0x0000FFFF; 
for (j = 16; j != 0; j) = ј >> 1, m = m ^ (m << j)) { 
for (k = 0; К < 32; k = (k + j + 1) & -3) { 


t = (A[k] ^ (A[kt5] >> 3)) & m; 
A[k] = A[k] ^ t; 
A[k+j] = A[k+j] ^ (t << j); 


#define swap(a0, al, j, m) t = (a0 ^ (al >>ј)) & m; 
80 = ао ^ +; \ 
al = al ^ (t << j): 


void transpose32(unsigned A[32], unsigned B[32]) { 
unsigned m, t; 
unsigned a0, al, a2, a3, a4, a5, a6, a7, 
a8, a9, а10, all, al2, а13, al4, а15, 
al6, а17, а18, а19, a20, a21, a22, a23, 
a24, a25, a26, a27, a28, a29, a30, a31; 


= А[ 0]; al = А[ 1]; a2 = А[ 2]; аз 
= А[ 4]; a5 = А[ 5]; аб =A[ 6]; a7 


A[28]; а29 A[29]; A[30]; 


0x0000FFFF; 
swap(a0, al6, 16, m) 
swap(al, 417, 16, m) 
swap(al5, 831, 16, m) 
m = OxOOFFOOFF; 
swap(a0, a8, 8, m) 
swap(al, a9, 8, m) 


swap(a28, a29, 1, m) 
swap(a30, a31, 1, m) 


B[ 0] = аб; B[ 1] = 
B[ 4] = a4; ВІ 5] 


B[28] = a28; Β[29] 


abcd efgh ijkl mnop qrst uvwx yzAB CDEF, 


0000 1111 0011 0011 1010 1010 0101 0101, 


0000 0000 0000 0000 efgh klop qsuw 2ВГЕ, 


unsigned compress(unsigned x, unsigned m) ( 
// Result, shift, mask 


unsigned r, s, b; 


& 1; 
| ((x & b) «« s); 
+ b; 
>> 1; 
>> l? 
) while (m != 0); 
return r; 


x = abcd efgh ijkl mnop qrst uvwx yzAB CDEF, 
m = 1000 1000 1110 0000 0000 1111 0101 0101, 
1 1 111 


9 6 333 4444 3 2 10 


x = a000 e000 ijkO 0000 0000 uvwx 020В ODOF. 


mk 
mp 


1110 1110 0011 1111 1110 0001 0101 0100, 
1010 0101 1110 1010 1010 0000 1100 1100. 


mv — 1000 0000 1110 0000 0000 0000 0100 0100. 


(m ^ mv) | (mv >> 1); 


t = x & mv; 
x = (x “ t) | (t >> 1); 


m = 0100 1000 0111 0000 0000 1111 0011 0011, 
x = 0800 e000 Oijk 0000 0000 uvwx 0028 OODF. 


mk = 0100 1010 0001 0101 0100 0001 0001 0000. 


unsigned compress(unsigned x, unsigned m) ( 
unsigned mk, mp, mv, t; 


int i; 


x & m; 
-m << 


1; 


// Clear irrelevant bits. 
// We will count 0's to right. 


for (i = 0; i < 5; i++) ( 


- mk 
mp 
mp 
mp 


x & 


А (mk «« 
(mp «« 
(mp «« 
(mp «« 
(mp «« 
m; 

mv | (mv 

mv; 


1); // Parallel suffix. 
2); 
4); 
8); 
16); 

// Bits to move. 
>> (1 << i)); // Compress m. 


=x*t | (t >> (1 << i)); // Compress x. 


= mk 


} 


return x; 


& -mp; 


ct ct ct ct cr X 


“= ν ox ox x 
m m m m m m 


m; 

πνο; 
mvi; 
mv2; 
mv3; 
mv4; 


жмых 


иии a 


ct ct ct ct ct 


(t 
(t 
(t 
(t 
(t 


>> 
>> 
>> 
>> 
>> 


i=0, 
After PS, 


After PS, 


m = 0101 0101 0101 0101 0101 0101 0101 0101, 


mv0 
mv1 
mv2 
mv3 
mv4 


0100 
0011 
0000 
0000 
0000 


0100 
0000 
1111 
0000 
0000 


0100 
0011 
0000 
1111 
0000 


0100 
0000 
0000 
1111 
0000 


0100 
0011 
0000 
0000 
0000 


0100 
0000 
1111 
0000 
0000 


0100 
0011 
0000 
0000 
0000 


0100 
0000 
0000 
0000 
0000 


x = ((x ^ у) & mv) ^ x; 


unsigned expand(unsigned x, unsigned m) ( 
unsigned m0, mk, mp, mv, t; 
unsigned array[5]; 
int i; 


πο = m; // Save original mask. 
mk = ~m << 1; // We will count 0's to right. 


for (i = 0; i < 5; 1++) ( 
mp = << 1); // Parallel suffix. 
^ << 2); 
4 «« 4); 
«« 8); 
^ «« 16); 
= & // Bits to move. 
array[i] = mv; 
(m ^ mv) | (mv >> (1 << i)); // Compress m. 
mk = mk & ~mp; 
} 


for (i = 4; i >= 0; i--) { 
mv = array[i]; 
t=x << (1 << i); 
x = (x & -mv) | (t & mv); 


) 


return x & m0; // Clear out extraneous bits. 


abcd efgh ijkl mnop qrst uvwx yzAB CDEF 
0111 1110 0110 1100 1010 1111 0011 0010 


Bit pairs, x 
m 


Орса ef0g 0jOk mn00 0405 uvwx 00AB 000E 
0100 0001 0101 0010 0101 0000 1000 1001 


Nibbles, x 


Obed Oefg 00jk 00mn 0045 uvwx 00AB 000E 
0001 0001 0010 0010 0010 0000 0010 0011 


B 
I 


Bytes, x = 00bc defg 0000 jkmn 0045 uvwx 0000 OABE 
m - 0000 0010 0000 0100 0000 0010 0000 0101 


Halfwords, x 
m 


0000 00bc defg jkmn 0000 000q suvw xABE 
0000 0000 0000 0110 0000 0000 0000 0111 


Words, x = 0000 0000 0000 Obcd efgj kmnq suvw xABE 
m = 0000 0000 0000 0000 0000 0000 0000 1101 


Input x 
Mask m 


abcd efgh ijkl mnop qrst uvwx yzAB CDEF 
0111 1110 0110 1100 1010 1111 0011 0010 


Onop qrs0 0640 ум00 х0у0 2АВС 00DE 00ЕО. 


πιο = 


ml 


m2 - 
m3 - 


m4 


1000 
0100 
0001 
0000 
0000 


0001 
0001 
0001 
0010 
0000 


1001 
0101 
0010 
0000 
0000 


0011 
0010 
0010 
0100 
0110 


0101 
0101 
0010 
0000 
0000 


0000 
0000 
0000 
0010 
0000 


1100 
1000 
0010 
0000 
0000 


1101 
1001 
0011 
0101 
0111 


x = hijk Ίππο pqrs tuvw qrst uvwx yzAB CDEF. 


x = lmno pqrs pqrs tuvw vwxy ZABC yzAB CDEF. 


x = mnop pqrs rstu tuvw vwxy ZABC BCDE CDEF. 


х = шпор qrrs sttu vwvw wxxy ZABC BCDE DEEF. 


X = mnop qrss stuu vwww XXyy ZABC CCDE EEFF. 


x = Onop qrs0 05640 vw00 х0у0 zABC 00DE OOFO. 


00100 
00101 
11111 
00000 
00001 
00010 
00011 


p[0] = 1010 
p[1] = 1100 
p[2] = 0000 
p[3] = 0000 


p[4] = 0000 


1010 
1100 
1111 
1111 
1111 


1010 
1100 
0000 
1111 
1111 


1010 
1100 
1111 
0000 
1111 


1010 
1100 
0000 
0000 
1111 


1010 
1100 
1111 
1111 
0000 


1010 
1100 
0000 
1111 
0000 


1010 
1100 
1111 
0000 
0000 


SAG(x, m) = compress left(x, m) | compress(x, -m). 


SAG(x, 

SAG(p[1], 
SAG(p[2], 
SAG(p[3], 
SAG(p[4], 


SAG(x, 

SAG(p[2], 
SAG(p[3], 
SAG(p[4], 


SAG(x, 
SAG(p[3], 
SAG(p[4], 


SAG(x, 


- SAG(p[4], 


SAG(x, 


р[0]); 
р[0]); 
р[0]); 
р[0]); 
р[0]); 


р[1]); 
р[1]); 
р[1]); 
р[1]); 


р[2]); 
р[2]); 
р[2]); 


р[3]); 
р[3]); 


р[4]); 


р(11 
Ρ[2] 
Ρ[3] 
р[4] 


SAG(p[1], р[0]); 

SAG(SAG(p[2], р[0]), р[1]); 

SAG(SAG(SAG(p[3], р[0]), p[11), РГ21)7 
SAG(SAG(SAG(SAG(p[4], р[0]), р[1]), Р[2]), р[3]); 


κ κ ν 4 ж 


ЗАС(х, 
SAG(x, 
SAG(x, 
ЗАС(х, 
ЗАС(х, 


р[0]); 
р[1]); 
р[2]); 
р[3]); 
р[4]); 


bitgather Rt,Rx,Ri, 


void mulmns(unsigned short w[], unsigned short u[], 
unsigned short v[], int m, int n) ( 
unsigned int k, t, b; 
int i, j; 


for (i = 0; i < m; i++) 
w[i] = 0; 


j < п; jH) 4 
0; 
for (i = 0; i < m; i++) { 
t = u[i]*v[j] + w[i + 4] + К; 
w[i + j] = t; // (I.e., t & OxFFFF). 
К = t >> 16; 
) 


w[j + m] = k; 


Now w[] has the unsigned product. Correct by 
subtracting v*2**16m if u « 0, and 
subtracting u*2**16n if v « 0. 


((short)u[m - 1] « 0) ( 
b = 0; // Initialize borrow. 
for (j = 0; j < n; j++) { 
t мј + m] - vij] - b; 
w[j + m] = t; 
b = t >> 31; 
H 


((short)v[n - 1] < 0) ( 

b = 0; 

for (i = 0; i < m; i++) í 
t = w[i + n]- u[i] - b; 
w[i + n] = t; 
b = t >> 31; 


) 
) 


return; 


int mulhs(int u, int v) ( 
unsigned u0, v0, м0; 
int ul, vl, wl, w2, t; 


u & OxFFFF; ul = u >> 16; 
v ἃ OXxFFFF; vl = v >> 16; 
u0*v0; 
ul*vO + (w0 >> 16); 
t & OxFFFF; 
= t >> 16; 
u0*vl + wl; 
return ul*vl + w2 + (wl >> 16); 


truncating 


2 
-2 
-2 

2 


rem 1 
rem -1 
rem 1 
rem -1 


floor 
2 rem 
-3 rem 
-3 rem 
2 rem 


-2 
-1 


if (imin >= 0) gmin (imin/10)*10; 
else gmin = ((imin - 9)/10)*10; 
if (imax >= 0) gmax = ((imax + 9)/10)*10; 
else gmax (imax/10)*10; 


int divmnu(unsigned short q[], unsigned short r[], 
const unsigned short u[], const unsigned short v[], 
int m, int n) ( 


const unsigned b - 65536; // Number base (16 bits). 
unsigned short *un, *vn; // Normalized form of u, v. 


unsigned аһа+; // Estimated quotient digit. 
unsigned rhat; // A remainder. 
unsigned p; // Product of two digits. 


int 8, i, jy t, kj 


if (m< n || n <= 0 || v[n-1] == 0) 
return 1; // Return if invalid param. 
if (n == 1) 4 // Take care of 
k = 0; // the case of a 
for (j =m - 1; j >= 0; j--) { // single-digit 
q[j] = (k*b + u[j])/v[0]; // divisor here. 
К = (k*b + u[3]) - q[3j]*v[0]; 
} 
if (r != NULL) r[0] = k; 
return 0; 


} 


// Normalize by shifting v left just enough so that 
// its high-order bit is on, and shift u left the 

// same amount. We may have to append a high-order 
// digit on the dividend; we do that unconditionally. 


s = nlz(v[n-1]) - 16; // 0 <= з <= 16. 
vn = (unsigned short *)alloca(2*n); 
for (i = n - 1; i> 0; i--) 

vn[i] = (v[i] << s) | (v[i-1] >> 16-s); 
vn[0] = v[0] << s; 


un = (unsigned short *)alloca(2*(m + 1)); 
un[m] = u[m-1] >> 16-s; 
for (i = m - 1; i> 0; i--) 


un[i] = (u[i] << s) | (u[i-1] >> 16-s); 
un[0] = u[0] << s; 
for (j = m - n; j >= 0; j--) í // Main loop. 


// Compute estimate qhat of q[j]. 
ghat = (un[j-*n]*b + un[j-*n-1])/vn[n-1]; 


rhat = (un[j*n]*b + un[j*n-1]) - ghat*vn[n-1]; 


if (qhat >= b || qhat*vn[n-2] > b*rhat + un[j*n-2]) 
1 qhat = ghat - 1; 

rhat = rhat + vn[n-1]; 

if (rhat « b) goto again; 
} 


// Multiply and subtract. 
k = 0; 
for (i = 0; i < n; itt) í 
р = qhat*vn[i]; 
t = un[i*j] - k - (p & OxFFFF); 
un[i*tj] = t; 
k = (p >> 16) - (t >> 16); 


un[j*n] - К; 
un[j+n] = t; 


4131 = qhat; // Store quotient digit. 


if (t < 0) 4 // If we subtracted too 
q[j] = q[3] - 1; // much, add back. 
k 0; 
for (i = 0; i < n; 1++) ( 
t = un[i*j] + vn[i] + k; 
un[i*j] = t; 
К = t >> 16; 


} 
ип [+0] = un[j*n] + k; 
} 
) // End j. 
// If the caller wants the remainder, unnormalize 
// it and pass it back. 
if (r != NULL) ( 
for (i = 0; i < n; i++) 
r[i] = (un[i] >> s) | (un[i*1] << 16-s); 
) 


return 0; 


unsigned divlu(unsigned x, unsigned y, unsigned z) ( 
// Divides (x || y) by z. 
int i; 
unsigned t; 


for (i = 1; i <= 32; i++) 4 
(int)x >> 31; // All 1's if x(31) 
(x << 1) | (y >> 31); // Shift x || y left 


y << 1; // one bit. 
if ((x | t) >= z) 4 
x =х- 2; 
у=у+1 
} 
} 


return y; // Remainder is x. 


unsigned divlu(unsigned ul, unsigned u0, unsigned v, 
unsigned *r) { 
const unsigned b = 65536; // Number base (16 bits). 


unsigned unl, опо, // Norm. dividend LSD's. 
vnl, vn0, // Norm. divisor digits. 
ql, q0, // Quotient digits. 
un32, un21, unl0,// Dividend digit pairs. 
rhat; // A remainder. 

int s; // Shift amount for norm. 

if (ul >= v) { // If overflow, set rem. 

if (r != NULL) // to an impossible value, 
*r = OxFFFFFFFF; // and return the largest 
return OxFFFFFFFF;) // possible quotient. 

s - nlz(v); // 0 <= s <= 31. 

у = v << 8; // Normalize divisor. 

vnl = v >> 16; // Break divisor up into 

vn0 = v ἃ OxFFFF; // two 16-bit digits. 

un32 = (ul << s) | (u0 >> 32 - s) ἃ (-s >> 31); 

0п10 = u0 << s; // Shift dividend left. 

unl = unlO >> 16; // Break right half of 

un0 = unlO0 8 OxFFFF; // dividend into two digits. 

ql = un32/vnl; // Compute the first 

rhat = un32 - ql*vnl; // quotient digit, ql. 

againl: 
if (ql >= b || gi*vn0 > b*rhat + unl) { 
qi = ql = 1; 


rhat = rhat + vnl; 
if (rhat « b) goto againl;) 


un21 = un32*b + unl - ql*v; // Multiply and subtract. 


q0 = un21/vnl; // Compute the second 
rhat = un21 - q0*vnl; // quotient digit, а0. 
again2: 
if (q0 >= b || q0*vnO > b*rhat + un0) í 
90 = q0 - 1; 


rhat = rhat + vnl; 
if (rhat « b) goto again2;} 


if (r != NULL) // If remainder is wanted, 
Жү = (un21*b + un0 - q0*v) >> s; // return it. 
return ql*b + q0; 


int divls(int ul, unsigned u0, int v, int *r) ( 
int q, uneg, vneg, diff, borrow; 


uneg = ul >> 31; // -1 if u < 0. 

if (uneg) ( // Compute the absolute 
ud = -u0; // value of the dividend u. 
borrow = (u0 != 0); 
ul = -ul - borrow; } 


vneg = v >> 31; // -1 if v < 0. 
v = (v ^ vneg) - vneg; // Absolute value of v. 


if ((unsigned)ul >= (unsigned)v) goto overflow; 


q = divlu(ul, u0, v, (unsigned *)r); 


diff - uneg ^ vneg; // Negate q if signs of 
q = (q ^ diff) - diff; // u and v differed. 
if (uneg && r !- NULL) 

жр = =r}; 


if ((diff ^ а) < 0 && q != 0) { // If overflow, 
overflow: // set remainder 
if (r != NULL) // to an impossible value, 
*r = 0x80000000; // and return the largest 
а = 0х80000000; } // possible neg. quotient. 
return q; 


— long long udivdi3(unsigned long long u, | 
unsiqned lonq lonq v) £ 


unsigned long long udivdi3(unsigned long long u, 
unsigned long long v) 


unsigned long long u0, ul, vl, 90, ql, К, n; 
If v « 2**32: 


If u/v cannot overflow, 
just do one division. 


if (v >> 32 == 0) { 
if (u >> 32 < v) 
return DIVU(u, v) 
& OxFFFFFFFF; 
else ( If u/v would overflow: 
ul = u >> 32; Break u up into two 
00 = u & OxFFFFFFFF; halves. 
ql DIVU(ul, v) First quotient digit. 
& OxFFFFFFFF; 
- ul - ql*v; First remainder, < v. 
а0 = DIVU((k << 32) + u0, v) // 2nd quot. digit. 
& OxFFFFFFFF; 
return (ql << 32) + q0; 


) 


п = nlz64(v); 
= (v << n) >> 32; 


=u >> 1; 

= DIVU(ul, vl) 
& OxFFFFFFFF; 

= (ql << n) >> 


(40 != 0) 

q0 = 90 - 1; 

((u - q0*v) >= 

40 = q0 + 1; 
return q0; 


) 


п = п1264(у); 
= (v << n) >> 32; 


= u >> 1; 
= DIVU(ul, vl) 
& OxFFFFFFFF; 
40 = (ql << n) >> 


if (40 != 0) 
q0 = q0 - 1; 
if ((u - q0*v) »- 
q0 = q0 + 1; 
return q0; 


Here v >= 2**32. 

0 <= n <= 31. 
Normalize the divisor 
so its MSB is 1. 

To ensure no overflow. 
Get quotient from 
divide unsigned insn. 
Undo normalization and 
division of u by 2. 
Make q0 correct or 
too small by 1. 


Now q0 is correct. 


Here v >= 2**32. 

0 <= n <= 31. 
Normalize the divisor 
so its MSB is 1. 

To ensure no overflow. 
Get quotient from 
divide unsigned insn. 
Undo normalization and 
division of u by 2. 
Make q0 correct or 
too small by 1. 


Now q0 is correct. 


if (u < q0*v) q0 = q0 - 1; 


if ((u - q0*v) >= v) q0 = q0 - 1; 


if (40 != 0) // Make q0 correct or 
q0 = q0 - 1 // too small by 1. 


if (40 == 0) return 0; 


if (u « v) return 0; // Avoid a problem later. 


q0 = 90 - (90 != 0); 


#define llabs(x) \ 
({unsigned long long t = (x) >> 63; ((x) ^ t) - +;}) 


long long divdi3(long long u, long long v) { 


unsigned long long au, av; 
long long q, t; 


au llabs(u); 
av llabs(v); 


if (av >> 31 == 0) { // If |v| < 2**31 and 
if (au < av << 31) { // |u|/|v| cannot 
q = DIVS(u, v); // overflow, use DIVS. 
return (q << 32) >> 32; 


} 
} 


q = au/av; Invoke udivdi3. 
t = (и ^ у) >> 63; If u, v have different 
return (q ^ t) - t; signs, negate q. 


if ((v << 32) >> 32 == v) { // If v is in range and 


shrsi t,n,k-1 Form the integer 

shri t,t,32-k 2**k - 1 if n < 0, else 0. 
add t,n,t Add it to n, 

shrsi q,t,k and shift right (signed). 


bge n,label Branch if n >= 0. 
addi n,n,2**k-1 Add 2**k - 1 to n, 
label shrsi n,n,k and shift right (signed). 


shrsi q,n,k 
addze q,q 


shli r,q,k 
sub r,n,r 


shrsi t,n,k-1 Form the integer 

shri t,t,32-k 2**k - 1 if n < 0, else 0. 
add t,n,t Add it ton, 

andi t,t,-2**k clear rightmost k bits, 
sub r,n,t and subtract it from n. 


li M,0x55555556 Load magic number, (2**32+2)/3. 


mulhs q,M,n q = floor(M*n/2**32). 
shri t,n,31 Add 1 to q if 

add  q,q,t n is negative. 

muli t,q,3 Compute remainder from 
sub Eint r= n - q*3. 


11 M,0x66666667 Load magic number, (2**33+3)/5. 


mulhs q,M,n q = floor(M*n/2**32). 
shrsi q,q,l 

shri t,n,31 Add 1 to q if 

add  q,q,t n is negative. 

muli t,q,5 Compute remainder from 
sub r,n,t r= n - а*5. 


li M,0x92492493 Magic num, (2**34+5)/7 - 2**32. 


mulhs q,M,n q = floor(M*n/2**32). 

add  q,q,n а = floor(M*n/2**32) + n. 
shrsi q,q,2 q = floor(q/4). 

shri t,n,31 Add 1 to q if 

add  q,q,t n is negative. 


Compute remainder from 


muli t,q,7 
r,n,t r = n - 4*7. 


sub 


li M, 0x6DB6DB6D 
mulhs q,M,n 
sub  q,q,n 
shrsi q,q,2 
shri t,q,31 
add q,q,t 


muli t,q,-7 
sub r,.n,t 


Magic num, -(2**34+5)/7 + 2**32. 


q = floor(M*n/2**32). 
q = floor(M*n/2**32) - n. 
q = floor(q/4). 


Add 1 to q if 
q is negative (n is positive). 


Compute remainder from 
r= n -qg*(-7). 


q = να; 


r = 2*r; 
If (r >= abs(d) ) { 
q = q+ 1; 


г = г - abs(d);) 


struct ms {int M; // Magic number 
int s;}; // and shift amount. 


struct ms magic(int d) ( // Must have 2 «- 2**31-1 
// or -2**31 «- -2. 
int p; 
unsigned ad, anc, delta, ql, rl, q2, r2, t; 
const unsigned two31 = 0x80000000; // 2**31. 
struct ms mag; 


ad abs(d); 
= two3l + ((unsigned)d >> 31); 
anc = t - 1 - t$ad; Absolute value of nc. 
p = 31; Init. p. 
= two3l/anc; Init. ql = 2**p/|nc|. 

two3l - ql*anc; Init. rl - rem(2**p, |nc|). 
two31/ad; Init. q2 = 2**p/|d|. 
two3l - q2*ad; Init. r2 - rem(2**p, |d|). 


=р+ 1; 
= 2*а1; Update q1 = 2**p/|nc|. 
- 2*rl; Update rl rem(2**p, |nc|). 
(rl >= anc) { (Must be an unsigned 
ql = 41 + 1; comparison here.) 
rl - rl - anc;) 
- 2*g2; Update q2 = 2**р/ |а|. 
- 2*r2; Update r2 - rem(2**p, |d|). 
(r2 >= ad) { (Must be an unsigned 
= q2 +1; comparison here.) 
= r2 - ad;} 
= ad - r2; 
(ql « delta || (ql == delta && rl == 0)); 


q2 + 1; 

0) mag.M = -mag.M; // Magic number and 

р = 32; // shift amount to return. 
return mag; 


Ti М, OXAAAAAAAB 
mulhu q,M,n 
shri q,q,1 
muli t,q,3 
sub r,n,t 


Load magic number, (2**33+1)/3. 
q = floor(M*n/2**32). 


Compute remainder from 
r=n - q*3. 


li 


M,0x24924925 


mulhu q,M,n 


add 


q,q,n 


shrxi q,q,3 


muli t,q,7 


sub 


r,n,t 


Magic num, (2**35+3)/7 - 2**32. 
q = floor(M*n/2**32). 

Can overflow (sets carry). 
Shift right with carry bit. 


Compute remainder from 
r = n - q*7. 


li M,0x24924925 


mulhu q,M,n 
sub t,n,q 


sbri t,t,l 
add t,t,q 
shri q,t,2 
muli +,9,7 


sub r,n,t 


Magic num, 


Q ç rr ctaQ 


= n - q. 


(n - q)/2. 


(2**35+3)/7 - 2**32. 
floor(M*n/2**32). 


(п - q)/2 + q = (n + q)/2. 


(n+Mn/2**32)/8 


floor(n/7). 


Compute remainder from 


r 


n - q*7. 


struct mu (unsigned M; // Magic number, 
int a; // "add" indicator, 
int s;); // and shift amount. 


struct mu magicu(unsigned d) ( 
// Must have 1 <= d <= 
int p; 
unsigned nc, delta, ql, rl, q2, r2; 
struct mu magu; 


magu.a = 0; // Initialize “add” indicator. 
nc = -1 - (-d)$d; // Unsigned arithmetic here. 
p = 31; // Init. p. 
= 0x80000000/nc; // Init. ql = 2**p/nc. 
= 0х80000000 - а1*пс;// Init. rl rem(2**p, nc). 
= Ox7FFFFFFF/d; // Init. q2 (2**p - 1)/d. 
Ox7FFFFFFF - q2*d; // Init. r2 = rem(2**p - 1, d). 


1 
p =p + 1; 
if (rl >= nc - rl) ( 
ql = 2*ql + 1; // Update 
rl = 2*rl - nc;) // Update 
else ( 
ql = 2*ql; 
rl = 2*г1;} 
if (r2 + 1 >= d - r2) { 
if (q2 >= Ox7FFFFFFF) magu.a = 1; 
q2 = 2*q2 + 1; // Update 
r2 = 2*r2 + 1 - d;} // Update 
else { 
if (q2 >= 0x80000000) magu.a = 1; 
q2 = 2*q2; 
r2 = 2*r2 + 1;) 
delta = d - 1 - r2; 
) while (p < 64 && 
(q1 < delta || (q1 == delta && rl == 0))); 


magu.M = q2 + 1; Magic number 
magu.s р - 32; and shift amount to return 
return magu; (magu.a was set above). 


Мз = 0 Мз = 1 


mulhs q,M,n mulhs q,M,n 
shrsi t,n,31 shrsi t,n,31 
and Е.М апа Б > | 
ааа q,q,t add ttn 


add q,q,t 


struct mu (unsigned M; // Magic number, 
int a; // "add" indicator, 
int s;); // and shift amount. 


struct mu magicu2(unsigned d) ( 
// Must have 1 «- d «- 2**32-1. 
int p; 
unsigned p32, q, r, delta; 
struct mu magu; 
magu.a = 0; // Initialize "add" indicator. 
p = 31; // Initialize p. 
q = Ox7FFFFFFF/d; // Initialize q = (2**p - 1)/d. 
= Ox7FFFFFFF - q*d; // Init. r = rem(2**p - 1, d). 
do { 
p =p + 1; 
if (p == 32) p32 = 1; // Set p32 = 2**(p-32). 
else p32 = 2*p32; 
if (rt L Sw Q т) í 
if (q >= Ox7FFFFFFF) magu.a = 1; 
q = 2*q + 1; // Update q. 
г = 2*r + 1 - d; // Update r. 
} 
else { 
if (q >= 0x80000000) magu.a = 1; 
q = 2*q; 
£ 2*r + 1; 
} 
delta = d - 1 - r; 
y while (p < 64 && p32 < delta); 
magu.M = q + 1; // Magic number and 
magu.s = p - 32; // shift amount to return 
return magu; // (magu.a was set above). 


abs an,n 

li M,0x92492493 
mulhu q,M,an 

shri q,q,2 

shrsi t,n,31 

xor q,q,t 

sub q,q,t 


Magic number, (2**34+5)/7. 
q = floor(M*an/2**32). 


These three instructions 
negate q if n is 
negative. 


def magicgu(nmax, d): 
nc - (nmax//d)*d - 1 
nbits = int(log(nmax, 2)) + 1 
for p in range(0, 2*nbits + 1): 


if 2**p > nc*(d - 1 - (2**p - 1)%d): 
m = (2**p + d - 1 - (2**p - 1)$d)//d 
return (m, p) 
print "Can't find p, something is wrong." 
sys.exit(1) 


li M,OxB6DB6DB7 Mult. inverse, (5*2**32 + 1)/7. 
mul q,M,n а = n/7. 


7(-1) 
7( 1) 
7(-36) 
7( 37) 
7(-73) 


256( 1) 
256( 0) 
256( 1) 
256(-1) 
256( 2) 


unsigned mulinv(unsigned d) ( // d must be odd. 
unsigned xl, vl, x2, v2, x3, v3, q; 


xl = OxFFFFFFFF; vl 
х2 = 1; 
while (v2 > 1) 4 


а = v1/v2; 

x3 = х1 - q*x2; vl - q*v2; 
xl = x2; v2; 

x2 = x3; v3; 


} 


return x2; 


unsigned mulinv(unsigned d) // d must be 
unsigned xn, t; 


xn = d; 
loop: t = d*xn; 


if (t -- 1) return xn; 
xn - xn*(2 - t); 
goto loop; 


li 
mul 
li 


M,0xC28F5C29 
q,M,n 
c,0x0A3D70A3 


cmpleu t,q,c 


bt 


t,is mult 


Load mult. inverse of 25. 

q = right half of M*n. 

c - floor((2**32-1)/25). 
Compare q and c, and branch 
if n is a multiple of 25. 


li 
mul 


shrri 
li 


М, 0xC28F5C29 
q,M,n 


q,q,2 
c,0x028F5C28 


cmpleu t,q,c 


bt 


t,is mult 


Load mult. inverse of 25. 

q = right half of M*n. 
Rotate right two positions. 
c = floor((2**32-1)/100). 
Compare q and c, and branch 
if n is a multiple of 100. 


li 

mul 

li 

add 
shrri 
shri 
cmpleu 
bt 


М,0хС28Е5С29 
q,M,n 
c,0x051EB850 
q,q,c 

4,4,2 

c,c,l 

t,q,c 
t,is_mult 


Load mult. inverse of 25. 

q = right half of M*n. 

c = floor((2**31 - 1)/25) & -4. 
Add c. 

Rotate right two positions. 
Compute const. for comparison. 
Compare q and c, and 

branch if n is a mult. of 100. 


unsigned divu3(unsigned n) ( 
unsigned n0, nl, w0, wl, w2, t, а; 


по п & OxFFFF; 

nl =n >> 16; 

м0 n0*0xAAAB; 

t п1*0хАААВ + (w0 >> 16); 

wl = t & OxFFFF; 

w2 t >> 16; 

wl nO*OxAAAA + wl; 

q = п1*0ХАААА + w2 + (м1 >> 16); 
return q >> 1; 


Ч = (n >> 1) + (n >> 3); 


а >> 


unsigned divu3(unsigned n) ( 
unsigned q, r; 


n*0.0101 (approx). 


= (п >> 2) + (n >> 4); 
= n*0.01010101. 


= а + (а >> 4); 
а + (а >> 8); 


q + (а >> 16); 
=n - q*3; 
return а + (11*r >> 5); 


// return q + (5*(r + 1) >> 4); 
// return а + ((r + 5 + (г << 2)) >> 4);// Alternative 2. 


) 


// 0 <= г <= 15. 


// Returning q + r/3. 
// Alternative 1. 


unsigned divul00(unsigned n) 4 
unsigned q, r; 


(п >> 1) + (п >> 3) + (n >> 6) - (n >> 10) + 
(п >> 12) + (n >> 13) - (n >> 16); 


= а + (q >> 20); 
= q >> 6; 
=n - q*100; 
return q + ((r + 28) >> 7); 
// return q + (r > 99); 


) 


unsigned divul000(unsigned n) ( 
unsigned q, r, t; 


t = (n >> 7) + (n >> 8) + (n >> 12); 

а = (n >> 1) + t + (n >> 15) + (t >> 11) + (t >> 14); 
q 

I 


а >> 9; 
п - q*1000; 
return а + ((r + 24) >> 10); 
// return а + (г > 999); 


} 


n = n + (n>>31 & 5); 


int divs3(int n) ( 
q, r; 


=n + (n>>31 & 2); // Add 2 if n « 0. 
(п >> 2) + (n»» 4); // q = n*0.0101 (approx). 
а + (а >> 4); //q n*0.01010101. 


=q + (q >> 8); 
q+ (q >> 16); 
п - q*3; // 0 <= г <= 14. 
return q + (ll*r >> 5); // Returning а + r/3. 
// return а + (5*(r + 1) >> 4); // Alternative 1. 
// return q * ((r * 5 * (r «« 2)) »» 4);// Alternative 2. 


) 


int divs1000(int n) 4 
q, г, t; 


= n + (n>>31 8 999); 
(п >> 7) + (п >> 8) + (n >> 12); 
= (n >> 1) + t + (n >> 15) + (t >> 11) + (t >> 


(n >> 26) + (t >> 21); 
q >> 9; 
n - q*1000; 
return q + ((r + 24) >> 10); 
// return а + (r > 999); 
} 


n = pop(n & 0x55555555) - pop(n & OxAAAAAAAA); 


int remu3(unsigned n) ( 
n = pop(n ^ OxAAAAAAAA) + 23; // Now 23 <= п <= 55. 


n = pop(n ^ 0x2A) - 3; // Now -3 <= n <= 2. 
return n + (((int)n >> 31) & 3); 


int remu3(unsigned n) ( 


static char table[33] = (2, 0,1,2, 
0,1,2, 0,1,2, 0,3,2, 0,1,2, 
0,1,2, 0,1); 


п = pop(n ^ OxAAAAAAAA) ; 
return table[n]; 


n= (n >> 16) + (n & OxFFFF); 


int remu3(unsigned n) 
= (п >> + (n 
= (n >> (n 
= (п >> 


{ 

& OxFFFF); OxlFFFE. 

& OxOOFF); Ox2FD. 
(n & 0x000F); Ox3D. 

& 0x0003); 0х11. 

& 0х0003); 0х6. 

п << 1)) 8 3; 


+ 
+ 

= (п >> + (п 
+ 


n = (n >> (n 
return (0x0924 >> ( 


int remu3(unsigned n) ( 
static char table[62] = { 
r 0,1,2, 


rere, Yede 


0,1,2, 0,1,2 
0,1,2, 0,1,2 
0,1,2, 0,1,2 


, ег , rer 


(n >> 16) + (n & OxFFFF); // Max OxlFFFE. 

(n»» 8) * (n & OxOOFF); // Max Ox2FD. 

(n >> 4) + (n & 0x000F); // Max 0x3D. 
return table[n]; 


int remu5(unsigned n) ( 
(n >> 16) + (n & OxFFFF); 
(n >> 8) + (n & OxOOFF); 
(n >> 4) + (n & 0x000F); 


= (п>>4) - ((п>>2) & 3) + (n ἃ 3); 
return (01043210432 >> 3*(n + 3)) & 7; // 


Мах OxlFFFE. 
Max Ox2FD. 
Max Ox3D. 

-3 to 6. 
Octal const. 


static char table[62] = (0,1,2,3,4, 0,1,2,3,4, 


// Max 0х27ЕЕЕ. 
// Max 0x33D. 
// Max 0х4А. 


9) + (п ἃ 0х001ЕЕ); 
6) + (n 8 0x0003F); 


(п >> 
(п >> 


` 
ο 
и 
` 
+ 
` 
m 
N 
` 
= 
` 
o 
` 
ο 
и 
` 
+ 
m 
` 
N 
` 
a 
` 
o 
~ 
и 
= 
и 
г- 
— 
Ф 
С 
m 
н 
@ 
< 
ο 
(9) 
Ба) 
"m 
@ 
m 
n 


n = (n >> 15) + (n & OX7FFF); 


return table[n]; 


n 
n 


int remu7(unsigned n) ( 


// ЕЕЕЕЕЕС1 to 2FF. 


// FFFE0001 to 7FFF. 
// 0 to 4A. 


. 
, 


9); 


6); 


(0,1,2,3,4,5,6,7,8, 


и 
a 
A 
A 
я 
! 
fa 
by 
fa 
~ 
я 
o 
B 
[5] 


г = (га OxOIFF) - (г >> 


Static char table[75] 
r= (r & 0x003F) + (r >> 
return table[r]; 


int r; 
r 


int remu9(unsigned n) { 


int rems3(int n) ( 
unsigned r; 
static char table[62] = (0,1,2, 0,1,2, 
1,2, 0,1,2, 0,1,2, 0,1,2, 
‚2, 0,1,2, 0,1,2, 0,1,2, 
г2, 0,1); 


= (r >> 16) + (г & OxFFFF); // Max OxlFFFE. 
= (r >> 8) + (га OxOOFF); // Max Ox2FD. 
= (r > 4) + (га 0x000F); // Max Ox3D. 
- table[r]; 

return r - (((unsigned)n »» 31) «« (r & 2)); 


int rems5(int n) ( 
int r; 
static char table[62] 


(n >> 16) + (n & OxFFFF); 
(г >> 8) + (г & OxOOFF); 
(г > 4) + (r & 0x000F); 
table[r + 8]; 

return r - (((int)(n & -r) >> 


// FFFF8000 to 17FFE. 
// FFFFFF80 to 27D. 
// -8 to 53 (decimal). 


31) & 5); 


int rems9(int n) ( 
int r; 


(n & Ox7FFF) - (n >> 15); 

(г & 0х01ЕР) - (r >> 9); 

= (r & 0х003Е) + (г >> 6); 
= table[r + 2]; 

return r - (((int)(n & -r) >> 


// FFFF7001 to 17FFF. 
// FFFFFF41 to 0x27F. 
// -2 to 72 (decimal). 


31) & 9); 


int rems9(int n) ( 
int r; 


(n & Ox7FFF) - (n >> 15); 

(г & 0х01ЕР) - (r >> 9); 

= (r & 0х003Е) + (г >> 6); 
= table[r + 2]; 

return r - (((int)(n & -r) >> 


// FFFF7001 to 17FFF. 
// FFFFFF41 to 0x27F. 
// -2 to 72 (decimal). 


31) & 9); 


int remu3(unsigned n) ( 
return (0x55555555*n + (n >> 1) - (n >> 3)) >> 30; 


) 


int remu3(unsigned n) ( 
unsigned r; 


- «« 2); 
«« 4); 
«« 8); 
«« 16) 
>> 1); 
>> 3); 
return r »» 30; 


int remu5(unsigned n) ( 
n = (0x33333333*n + (n >> 3)) >> 29; 


return (0x04432210 >> (n << 2)) & 7; 


int remu7(unsigned n) ( 
n = (0x24924924*n + (n >> 1) + (n >> 4)) >> 29; 


return n & ((int)(n - 7) »» 31); 


int remu9(unsigned n) ( 
static char table[16] 


6, 6, 


n = (0x1C71C71C*n + (n >> 1)) >> 28; 
return table[n]; 


int remul0(unsigned n) 4 
static char table[16] = (0, 1, 2, 2, 3, 3, 4, 5, 
5, 6, 7, 7, 8, 8, 9, 0); 


п = (0x19999999*n + (n >> 1) + (n >> 3)) >> 28; 
return table[n]; 


H 


(n << 2) + (n << 8); 
r+ {т << 12); 
r+ (n << 26); 


// x 
7 x 
/7 x 


0x104*n. 
0x104104*n. 
0x04104104*n. 


int remu63(unsigned n) ( 
unsigned t; 


t (((n >> 12) + n) >> 10) + (n << 2); 
t = ((t >> 6) + + + 3) & OxFF; 
return (t - (t >> 6)) >> 2; 


int remu63(unsigned n) ( 
п = (0x04104104*n + (n >> 4) + (n >> 10)) >> 26; 


return n & ((n - 63) >> 6); // Change 63 to 0. 


int rems3(int n) ( 
unsigned r; 


r = n; 
r = (0x55555555*r + (r >> 1) - (r >> 3)) >> 30; 
return r - (((unsigned)n >> 31) << (r & 2)); 


int rems5(int n) 4 
unsigned r; 
static signed char table[16] = (0, 1, 2, 2, 3, u, 4, 0, 
u, 0,-4, u,-3,-2,-2,-1); 


r=n; 
г = ((0x33333333*r) + (г >> 3)) >> 29; 
return table[r + (((unsigned)n >> 31) << 3)]; 


int rems7(int n) ( 
unsigned r; 


n - (((unsigned)n >> 31) << 2); // Fix for sign. 


((0x24924924*r) + (г >> 1) + (г >> 4)) >> 29; 
= r & ((int)(r - 7) >> 31); // Change 7 to 0. 
return г - (((int)(n&-r) >> 31) & 7);// Fix п<0 case. 


int rems9(int n) ( 
unsigned r; 
static signed char table[32] = (0, 1, 1, 2, u, 3, u, 4, 
5, 5, 6, 6, 7, u, 8, u, 
-4, u,-3, u,-2,-1,-1, 0, 
u,-8, u,-7,-6,-6,-5,-5); 


r=n; 
г = (0x1C71C71C*r + (г >> 1)) >> 28; 
return table[r + (((unsigned)n >> 31) << 4)]; 


int remsl0(int n) { 
unsigned r; 
static signed char table[32] 


(0, 1, а, 2, 3, u, 4, 5, 
5, 6, u, 7, 8, u, 9, u, 
-6,-5, u,-4,-3,-3,-2, u, 
-1, 0, u,-9, u,-8,-7, uj; 


r=n; 
r = (0x19999999*r + (r >> 1) + (r >> 3)) >> 28; 
return table[r + (((unsigned)n >> 31) << 4)]; 


unsigned divu3(unsigned n) ( 
unsigned r; 


г = (0x55555555*n + (п >> 1) - (n >> 3)) >> 30; 
return (n - г) *ОхАААААААВ; 


int divs3(int n) ( 
unsigned r; 


=n; 
(0x55555555*r + (г >> 1) - (г >> 3)) >> 30; 
г - (((unsigned)n >> 31) << (r & 2)); 
return (n - г) *ОхАААААААВ; 


unsigned divulO(unsigned n) 4 
unsigned r; 
static char table[16] - (0, 1, 2, 5; 
5, 6, 7, 7, 8, 8, 9, 0); 


r (0x19999999*n + (n >> 1) + (n >> 3)) >> 28; 
r - table[r]; 
return ((n - r) >> 1)*0xCCCCCCCD; 


int isqrt(unsigned x) 
unsigned xl; 
int s, 90, gl; 


if (x <= 1) return 
s = 1; 
x1 

if (xl 
if (xl 
if (xl 
if (αι 


1; 

65535) (s >> 16;) 
(s - - >> 87} 
{s = = >> 4j] 
{s = 


V VV VI 


90 = 1 << s; 2**з. 
91 = (90 + (x >> s)) ; (90 + x/g0)/2. 


while (gl < 90) ( while approximations 
90 = gl; strictly decrease. 
91 = (g0 + (x/g0)) >> 1; 


} 


return 90; 


if (x <= 1) return x; 
s = 16 - nlz(x - 1)/2; 


int isqrt(unsigned x) ( 
int s, 90, gl; 


if (x «- 4224) 
if (x <= 24) 
if (x <= 3) return (x + 3) >> 2; 
else if (x <= 8) return 2; 
else return (x >> 4) + 3; 
else if (x «- 288) 
if (x <= 80) s = 3; else s = 4; 
else if (x <= 1088) s = 5; else s 6; 
else if (x <= 1025*1025 - 1) 
if (x <= 257*257 - 1) 
if (x <= 129*129 - 1) s = 7; else s 
else if (x <= 513*513 - 1) s = 9; else 
else if (x «- 4097*4097 - 1) 
if (x <= 2049*2049 - 1) s = 11; else s 
else if (x <= 16385*16385 - 1) 
if (x <= 8193*8193 - 1) s = 13; else s = 
else if (x <= 32769*32769 - 1) s = 15; else 
90 = 1 << s; // 90 = 2**s. 


// Continue as in Figure 11-1. 


int isqrt(unsigned x) ( 
unsigned a, b, m; // Limits and midpoint. 


а = 1; 
Ь = (x >> 5) + 8; // See text. 
if (b > 65535) b = 65535; 
do { 
m= (a + b) >> 
if (m*m > x) b 
else a 
} while (b >= a); 
return a - 1; 


(1 << (33 - nlz(x))/2) - 1; 
(b * 3)/2; 


0011 


0011 


0011 


01 


0011 
1001 


1010 


x0 
bl 


xl 
b2 


x2 
b3 


x3 
b4 


x4 


Initially, x = 179 (0xB3). 


0100 0000 
0010 0000 


0011 0000 
0001 1000 


0001 1000 
0000 1100 


0000 1101 


(Can't subtract). 


int isqrt(unsigned x) ( 
unsigned m, y, b; 


m = 0x40000000; 
y = 0; 
while(m != 0) { // Do 16 times. 


y. 


m = πι >> 


) 


return y; 


et 


х 


(іпё) (х | ~(x - b)) >> 31; 
х - (b & +); 
y | (m & t); 


// -1 if x >= b, else 0. 


int icbrt(unsigned x) ( 
int s; 
unsigned y, b; 


y-0; 

for (s = 30; s >= 0; s = 
y = 2*y; 
Ь = (3*y*(y + 1) + 1) << s; 


= $ - 3) { 


if (x >= b) í 
x = x - b; 
y = y + 1; 
} 
} 


return y; 


int iexp(int x, unsigned n) 
int p, y; 


у=1; 
p = x; 
while(1) í 


if (n & 1) y = p*y; 

в >> 1; 
if (n == 0) return y; 
р = р*р; 


Initialize result 
and p. 


If n is odd, mult by p. 
Position next bit of n. 
If no more bits in n. 

Power for next bit of n. 


int ilogl0(unsigned х) 4 
int i; 
static unsigned table[11] = (0, 9, 99, 999, 9999, 
99999, 999999, 9999999, 99999999, 999999999, 
OxFFFFFFFF); 


for (i = -1; ; i++) { 
if (x <= +ар1е[і+1]) return i; 


} 


p=1; 
for (i = -1; i <= 8; i++) { 
if (x < p) return i; 


p = 10*p; 
} 


return i; 


return 3 - ((x - 1000) >> 31); 
return 2 + ((999 - x) >> 31); 
return 2 + ((x + 2147482648) >> 31); 


return 8 + ((x + 1147483648) | x) >> 31; 


return ((int)(x - 1) »» 31) | ((unsigned)(9 - x) »» 31); 
return (x > 9) + (x > 0) - 1; 


int iloglO(unsigned х) { 
if (x » 99) 
if (x « 1000000) 
if (x « 10000) 
return 3 + ((int)(x - 1000) >> 31); 
else 
return 5 + ((int)(x - 100000) >> 31); 


else 
if (x « 100000000) 
return 7 + ((int)(x - 10000000) >> 31); 
else 
return 9 + ((int)((x-1000000000)&-x) >> 31); 
else 
if (x » 9) return 1; 
else return ((int)(x - 1) >> 31); 


int ilogl0(unsigned х) { 
int y; 
static unsigned char tablel[33] = (9, 9, 9, 8, 8, 8, 
7T; Ἵν Ἵν Ὅν 6,6, бү 5, 5, 5, ἃ, ἃ, 4,3, 3, 3, 3, 
2; 2, 25 l, 1, 1; 0, Ὃς 0, 0); 
static unsigned table2[10] - (1, 10, 100, 1000, 10000, 


100000, 1000000, 10000000, 100000000, 1000000000); 


y = tablel[nlz(x)]; 
if (x < table2[y]) Y = y - 1; 
return y; 


int ilogl0(unsigned x) 4 

int y; 

static unsigned char table1[33] = (10, 9, 9, 8, 
J, 7, Ty 6, 6, 6, 6, 5, 5, 5b, 4, 4, 4, 3, ἃ, 
2,2,2, 1, 1, 1, 0, 0, 0, 0); 

static unsigned table2[11] = (1, 10, 100, 1000, 10000, 
100000, 1000000, 10000000, 100000000, 1000000000, 
0); 


y tablel[nlz(x)]; 
y = у - ((x - table2[y]) >> 31); 
return y; 


у = y + (((table2[y*1] - x) | x) >> 31); 


static unsigned table2[10] = (0, 9, 99, 999, 9999, 
99999, 999999, 9999999, 99999999, 999999999); 


у = (9*(31 - п12(х))) >> 5; 
if (x > +ар1е2[у+1]) у = у + 1; 
return y; 


int ilogl0(unsigned x) { 
int y; 
static unsigned table2[11] = (0, 9, 99, 999, 9999, 
99999, 999999, 9999999, 99999999, 999999999, 
OxFFFFFFFF); 


y = (19*(31 - nlz(x))) >> 6; 
у = y + ((table2[y*1] - x) >> 31); 
return y; 


unsigned table2[20] = (0, 9, 99, 999, 9999, 
9999999999999999999}; 

y = ((19*(63 - nlz(x)) >> 6; 

у = у + ((+ар1е2[у+1] - x) >> 63; 

return y; 


Addition 

11.1.11 Jh 
44 ο 1 -l.1 19 
11 0 1 ο 1 *(-11) 


Subtraction 
11 1 1 
l 0 Ll 0-1 21 


00011010 
00011101 
11100101 
11110110 
00001001 


6 
13 
6 complemented 


(6 complemented) + 13 
Complement of the sum (-7) 


int divbm2(int n, int d) ( 
int r, dw, c, q, i; 


n; 
dw - (-128)*d; 
= (-43)*а; 
if (а> 0) с=с + d; 
q = 0; 


for (i = 7; i >= 0; i--) í 
if (d > 0 ^ (i&1) == 0 ^ 


а= | (1 << i); 
г = r - dw; 

) 

dw = dw/(-2); 

if (d» 0) c 

else c - 

с = c/(-2); 

) 


return q; 


q = n/d in base -2. 


Init. remainder. 
Position d. 
Init. comparand. 


Init. quotient. 


Set a quotient bit. 
Subtract d shifted. 


Position d. 
Set comparand for 
next iteration. 


Return quotient in 
base -2. 

Remainder is r, 

0 <= г < |а]. 


w w w w tu 


шшш ошо) 


>> 
>> 
>> 
>> 
>> 


for (i = 0; 1 < nj; it+) ( 
G = i ^ (i >> 1); 
output G; 


B = index(G); B = index(G); 
B=B+ 1; M = ~B & (B + 1); 


Gp = ^ (B >> 1); Gp = G * M; 


unsigned int crc32(unsigned char *message) ( 
int i, j: 
unsigned int byte, crc; 


, 
OxFFFFFFFF; 
while (message[i] !- 0) ( 

byte = message[i]; // Get next byte. 
byte = reverse(byte); // 32-bit reversal. 
for (j = 0; j <= 7; j++) ( // Do eight times. 

if ((int)(cre * byte) < 0) 

ere = (cre << 1) * 0x04C11DB7; 
else cre = cre << 1; 
byte = byte << 1; // Ready next msg bit. 


} 
1 


return reverse(-crc); 


unsigned int crc32(unsigned char *message) { 
int i, j; 
unsigned int byte, crc, mask; 


i = 0; 
cre = ΟΧΕΕΕΕΕΕΕΕ; 
while (message[i] != 0) { 
byte = message[i]; // Get next byte. 
crc = cre ^ byte; 
for (j = 7; j >= 0; j--) { // Do eight times. 
mask = -(crc & 1); 


cre = (crc >> 1) ^ (0хЕПВ88320 8 mask); 
} 
i 


} 


return ~crc; 


unsigned int crc32(unsigned char *message) { 
int i, j; 
unsigned int byte, crc, mask; 
Static unsigned int table[256]; 


/* Set up the table, if necessary. */ 


if (table[1] == 0) ( 
for (byte = 0; byte <= 255; byte++) { 
crc - byte; 
for (j = 7; j >= 0; j--) { // Do eight times. 
mask = -(crc & 1); 
cre = (crc >> 1) ^ (OxEDB88320 & mask); 


) 
table[byte] = crc; 


} 


/* Through with table setup, now calculate the CRC. */ 


; 
crc OxFFFFFFFF; 
while ((byte = message[i]) !- 0) 4 
crc = (crc >> 8) ^ table[(crc ^ byte) 8 OxFF]; 
i= i+ 1; 
} 


return -crc; 


unsigned int checkbits(unsigned int u) ( 


/* Computes the six parity check bits for the 
"information" bits given in the 32-bit word u. The 
check bits are p[5:0]. On sending, an overall parity 
bit will be prepended to p (by another process). 


Bit Checks these bits of u 

Р(0 0, 1, 3, 5, ..., 31 (0 and the odd positions). 
p[1] 0, 2-3, 6-7, ..., 30-31 (0 and positions xxxlx). 
p[2] 0, 4-7, 12-15, 20-23, 28-31 (0 and posns xxlxx). 
p[3] 0, 8-15, 24-31 (0 and positions x1xxx). 

p[4] 0, 16-31 (0 and positions 1xxxx). 

p[5] 1-31 */ 


unsigned int pO, pl, p2, p3, p4, p5, p6, p; 
unsigned int tl, t2, t3; 


// First calculate p[5:0] ignoring u[0]. 

pO = u ^ (u >> 2); 

pO = pO ^ (pO >> 4); 

pO = pO ^ (pO >> 8); 

pO = pO “ (pO >> 16); // pO is in posn 1. 


tl =u * (u >> 1); 

pl = tl * (tl >> 4); 

pl = pl * (pl >> 8); 

pl = pl * (pl >> 16); // pl is in posn 2. 


t2 = tl ^ (+1 >> 2); 
p2 = t2 * (t2 >> 8); 


p2 = p2 ^ (p2 >> 16); // p2 is in posn 4. 
t3 = t2 ^ (t2 >> 4); 

p3 = t3 ^ (t3 >> 16); // p3 is in posn 8. 
p4 = t3 ^ (t3 >> 8); // р4 is in posn 16. 
p5 = p4 ^ (p4 >> 16); // p5 is in posn 0. 


р = ((р0>>1) ἃ 1) | ((pl>>1) & 2) | ((p2>>2) & 4) | 
((p3>>5) & 8) | ((p4>>12) & 16) | ((p5 & 1) << 5); 


p = p ^ (-(u & 1) & Ox3F); // Now account for u[0]. 
return p; 


int correct(unsigned int pr, unsigned int *ur) ( 


// 
// 


/* This function looks at the received seven check 
bits and 32 information bits (pr and ur) and 
determines how many errors occurred (under the 
presumption that it must be 0, 1, or 2). It returns 
with 0, 1, or 2, meaning that no errors, one error, or 
two errors occurred. It corrects the information word 
received (ur) if there was one error in it. */ 


unsigned int po, p, syn, b; 


po = parity(pr ^ *ur); // Compute overall parity 
// of the received data. 

p = checkbits(*ur); // Calculate check bits 
// for the received info. 

syn = p ^ (pr ἃ Ox3F); // Syndrome (exclusive of 


// overall parity bit). 
if (po == 0) ( 


if (syn == 0) return 0; // If no errors, return 0. 
else return 2; // Two errors, return 2. 
) 
// One error occurred. 
if (((syn - 1) & syn) == 0) // If syn has zero or one 


return 1; // bits set, then the 
// error is in the check 
// bits or the overall 
// parity bit (no 
// correction required). 


// One error, and syn bits 5:0 tell where it is in ur. 


b = syn - 31 - (syn >> 5); // Map syn to range 0 to 31. 


if (syn == 0х1#) b = 0; // (These two lines equiv. 
else b = syn & Oxlf; // to the one line above.) 
*ur = *ur ^ (1 << b); // Correct the bit. 


return 1; 


pO = pop(u ^ OxAAAAAAAB) & 1; 
pl = pop(u & OxCCCCCCCD) & 1; 


void step(int); 
void hilbert(int dir, int rot, int order) ( 
if (order == 0) return; 


dir = dir + rot; 

hilbert(dir, -rot, order - 1); 
step(dir); 

dir = dir - rot; 

hilbert(dir, rot, order - 1); 

step(dir); 

hilbert(dir, rot, order - 1); 

dir = dir - rot; 

step(dir); 

hilbert(dir, -rot, order - 1); 
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0000 
0001 
0010 
0011 
0100 
0101 
0110 
0111 
1000 
1001 
1010 
1011 
1100 
1101 
1110 
1111 


00 
01 
01 
00 
00 
00 
01 
01 
10 
10 
11 
11 
11 
10 
10 
11 


00 
00 
01 
01 
10 
11 
11 
10 
10 
11 
11 
10 
01 
01 
00 
00 


#include <stdio.h> 
#include <stdlib.h> 


int x = -1, y = 0; // Global variables. 
int s = 0; // Dist. along curve. 
int blen; // Length to print. 


void hilbert(int dir, int rot, int order); 


void binary(unsigned k, int len, char *s) { 
/* Converts the unsigned integer k to binary character 
form. Result is string s of length len. */ 


int i; 

s[len] = 0; 

for (i = len - 1; i >= 0; i--) ( 
if (k & 1) s[i] = '1'; 
else s[i] = '0'; 
k = k >> 1; 

) 


) 
void step(int dir) ( 


char 11[33], xx[17], yy[17]; 


switch(dir & 3) ( 
case 0: x = x + 1; break; 
case 1: y = y + 1; break; 
case 2: x - x - 1; break; 
case 3: y = y - 1; break; 
) 


binary(s, 2*blen, ii); 

binary(x, blen, xx); 

binary(y, blen, yy); 

printf("$5d $s 85 s\n", dir, ii, xx, уу); 

s=s +1; // Increment distance. 
} 
int main(int argc, char *argv[]) { 

int order; 


order = atoi(argv[1]); 

blen = order; 

step(0); // Print init. point. 
hilbert(0, 1, order); 

return 0; 


void hil xy from s(unsigned s, int n, unsigned *xp, 
unsigned *yp) ( 
int i; 
unsigned state, x, y, row; 


state - 0; // Initialize. 
-0; 


= 2*n - 2; i >= 0; i--2) ( // Do n times. 
= 4*state | (s >> i) & 3; // Row in table. 
= (x << 1) | (0x936C >> row) & 1; 
= (y << 1) | (0x39C6 >> row) & 1; 
(0x3E6B94C1 >> 2*row) & 3; // New state. 


// Pass back 
// results. 


void hil xy from s(unsigned s, int n, unsigned *xp, 
unsigned *yp) 


int i, sa, sb; 
unsigned x, y, temp; 


for (i = 0; i < 2*n; i += 
sa = (s >> (i*1)) & 1; Get bit 1+1 of s. 
sb = (s >> i) & 1; Get bit i of s. 


if ((sa * sb) == 0) If sa,sb = 00 or 
temp = x; swap x and y, 
x = y^(-sa); and if sa = 1, 
y temp^(-sa); complement them. 


= (x >> 1) | (sa << 31); // Prepend sa to x and 
(у >> 1) | ((sa ^ sb) << 31); // (sa^sb) to y. 


= x >> (32 - n); // Right-adjust x and y 
= y >> (32 - п); // and return them to 
// the caller. 


swap = (sa ^ sb) - 1; // -1 if should swap, else 0. 
cmpl - -(sa & sb); // -1 if should compl't, else 0. 


x = x “y; 
y = y * (x & swap) ^ cmpl; 
x = x “ y; 


void hil xy from s(unsigned s, int n, unsigned *xp, 
unsigned *yp) ( 
unsigned comp, swap, cs, t, sr; 


s | (0x55555555 «« 2*n); // Pad s on left with 01 
sr = (s >> 1) & 0x55555555; // (no change) groups. 
cs = ((s & 0x55555555) + sr) // Compute complement & 

^ 0x55555555; // swap info in two-bit 
// groups. 

// Parallel prefix xor op to propagate both complement 
// and swap info together from left to right (there is 
// no step "cs ^- cs >> 1", so in effect it computes 
// two independent parallel prefix operations on two 
// interleaved sets of sixteen bits). 


cs - ^ (cs >> 2) 
^ (cs »» 4) 
^ (cs »» 8) 
^ (cs »» 16); 
= св & 0x55555555; // Separate the swap and 
- (cs >> 1) & 0x55555555; // complement bits. 


= (s & swap) ^ comp; // Calculate x and y in 
eg^gr -5-Ё 518 << Ту the odd & even bit 
positions, resp. 
S & ((1 << 2*n) - 1); Clear out any junk 
on the left (unpad). 


"unshuffle" to separate the and y bits. 


^ (s »» 1)) & 0x22222222; ^ t ^ (t «« 
^ (s >> 2)) & 0x0COCOCOC; ^ (t << 
^ (s >> 4)) & ΟΧΟΟΕΟΟΟΕΟ; To t ος 
^ (s >> 8)) & Ox0000FF00; ^ (t << 


= $ >> 16; // Assign two halves 
= s 8 OxFFFF; // of t to x and y. 


unsigned hil s from xy(unsigned x, unsigned y, int n) ( 


int i; 
unsigned state, s, row; 


state - 0; // Initialize. 


for (i = 


} 


return s; 


n- 1; i >= 0; i--) { 

row = 4*state | 2*((x >> i) & 1) | (y >> i) & 1; 
s = (s << 2) | (0x361E9CB4 >> 2*row) & 3; 

state = (0х8ЕЕ65831 >> 2*row) 8 3; 


unsigned hil s from xy(unsigned x, unsigned y, int n) ( 


int i, xi, yi; 
unsigned s, temp; 


s = 0; 

for (i = n - 1; i >= 0; i--) 
xi = (x >> i) & 1; 
yi = (y >> і) & 1; 


if (yi == 0) { 
temp = x; 
x = y^(-xi); 
y = temp^(-xi); 


= 4*g + 2*xi + (xi^yi); 


return s; 


{ 


Initialize. 
Get bit i of x. 
Get bit i of y. 


Swap x and y and, 
if xi = 1, 
complement them. 


Append two bits to s. 


void hil inc xy(unsigned *xp, unsigned *yp, int n) ( 


int i; 
unsigned x, y, state, dx, dy, row, dochange; 


= *хр; 
y= Сур) 
state = 0; // Initialize. 
dx = -((1 << n) - 1); // Init. -(2**n - 1). 
dy = 0; 


for (i = n-1; i >= 0; i--) { // Do n times. 
row = 4*state | 2*((x >> i) & 1) | (y >> i) & 1; 
dochange = (0xBDDB >> row) & 1; 
if (dochange) { 
dx = ((0x16451659 >> 2*row) & 3) - 1; 
dy = ((0x51166516 >> 2*row) & 3) - 1; 
) 
state (0x8FE65831 »» 2*row) & 3; 
) 
*xp = *xp + dx; 
хур = Хур + dy; 


0 10000000 10010010000111111011011 


40490FDB, 


union (int ihalf; float xhalf;); 
ihalf - ix - 0x00800000; 


float rsqrt(float x0) ( 
union (int ix; float x;); 


x 7 x0; // x can be viewed as int. 
float xhalf = 0.5f*x; 


ix = 0x5f375a82 - (ix >> 1); // Initial guess. 


x = X*(1.5f - xhalf*x*x); // Newton step. 
return x; 


union (int ix; float x;}; // Make ix and x overlap. 


0х5Е000000 - (іх >> 1); // Refer to x as integer ix. 


for (i = 0; i <= OxFFFFFFFF; i++) {...} 


i = OxFFFFFFFF; 
do {i = i + 1; ...) while (i < OxFFFFFFFF); 


SELCC  RT,RA,RB,RC 


MUX RT,RA,RB,RC 


RT «-- RA & RC | RB & -RC 


SHLD RT,RA,RB,RC 


111...1111 
* 000...0100 


* 1 (end-around carry) 


000...0100 


for (il = 0; il < 3; 11++) ( 
for (i2 = il + 1; i2 < 3; i2++) í 


/* Determines which of the 256 Boolean functions of 
three variables can be implemented with three binary 
Boolean instructions if the instruction set includes 
all 16 binary Boolean operations. */ 


#include <stdio.h> 
char found[256]; 


unsigned char boole(int op, unsigned char x, 
unsigned char y) { 
switch (op) { 
case 0: return 
case 1: return 
case 2: return 
case 3: return 
case 4: return 
case 5: return 
case 6: return y; 
case 7: return x | у; 
case 8: return -(x | у); 
case 9: return -(x ^ y); 
case 10: return -y; 
case return x -γ; 
case return -x; 
case return ~x | y; 
case return -(x & y); 
case return OxFF; 


#define NB 16 // Number of Boolean operations. 
int main() { 


int 1, j, ol, il, 12, o2, 931, 32, o3, kl, k2; 

unsigned char fun(6];// Truth table, 3 columns for 
// x, y, and z, and 3 columns 
// for computed functions. 


fun[0] ΟΧΟΕ; // Truth table column for x, 
fun[1] = 0x33; /7 y, 
fun[2] 0x55; // and z. 


for (01 = 0; οἱ < NB; о1++) { 
for (il = 0; il < 3; il++) ( 
for (i2 = 0; i2 < 3; i2++) { 
fun[(3] = boole(ol, fun[il], fun[i2]); 
for (02 = 0; o2 < NB; o2++) { 
for (31 = 0; jl < 4; jl++) 4 
for (j2 = 0; j2 < 4; 1284) { 
fun[4] = boole(o2, fun[jl], fun[j2]); 
for (03 0; o3 < NB; o3**) ( 
for (kl = 0; κι < 5; К1++) 4 
for (k2 = 0; k2 < 5; k2++) { 
fun[5] = boole(o3, fun[kl], fun[k2]); 
found[fun[5]] = 1; 
}}} 
n) 


n) 
printf(" 0123456789ABCDE FW"); 


for (i = 0; i < 16; i++) í 
printf("$X", i); 
for (j) = 0; j < 16; j++) 
printf("$2d", found[16*i + j]); 
printf("\n"); 
} 


return 0; 


т - x $ 10; 

ух - Е} 

if (r > 5 | (r = 5 & (y & 2)! 
у =у+ 10] 


0) 


r = (x + 5)$10; 

үх этэ ээ 

if (r == O & (y & 2) != 0) 
y= y = 10] 


int loadUnaligned(int *a) { 
int *alo, *ahi; 
int xlo, xhi, shift; 


alo (int *)((int)a & -4); 

ahi (int *)(((int)a * 3) & -4); 

xlo = *alo; 

xhi = *ahi; 

shift = ((int)a & 3) << 3; 

return ((unsigned)xlo >> shift) | (xhi << (32-shift)); 


temp = nlz(c & d); 

if (temp -- 0) return OxFFFFFFFF; 
m= 1 << (32 - temp); 

return b | d | (m- 1); 


int ntz (unsigned int n) ( 
static unsigned char tab[32] - 
í 0, 1, 2, 24, 3, 19, 6, 
22, 4, 20, 10, 16, 7, 12, 
31, 23, 18, 5, 21, 9, 15, 
30, 17, 8, 14, 29, 13, 28, 
H 
unsigned int k; 
n=n & (-n); /* isolate lsb */ 
defined(SLOW MUL) 
= (n << 11) - n; 
(К << 2) + k; 
(К << 8) + n; 
(К << 5) - k; 
#else 
k n * 0x4d7651f; 
#endif 
return n ? tab[k>>27] : 32; 


int fmaxstrl(unsigned x, int *apos) ( 
int k; 
unsigned oldx; 


x != 0; k++) ( 


*apos = nlz(oldx); 
return k; 


int bestfit(unsigned x, int n, int *apos) 
int m, s; 


m= n; 

while (m » 1) ( 
s = m >> 1; 
x = x & (X << s); 
m =m- s; 

) 


return fminstrl(x, apos) + n - 1; 


int fminstrl(unsigned x, int *apos) ( 
int К, kmin, y0, y; 
unsigned int x0, xmin; 


kmin = 32; 
yo рор(х); 
x0 X; 
do 
= ((X & -X) + X) ах; 
pop(x); 
που у} 
(k <= kmin) { 
kmin = k; 
xmin х: 


yO = yi 
} while (x != 0); 
*apos = п12(х0 ^ xmin); 
return kmin; 


// 
// 
// 
// 
FE 
// 


Turn off rightmost 
string. 

k = length of string 
turned off. 

Save shortest length 
found, and the string. 


low = u*v; 


low = (wl << 16) + (wO & OxFFFF); 


low = ((tl + t2) << 16) + м0; 


unsigned mulhu(unsigned u, unsigned v) ( 
unsigned a, b, c, d, p, q, rlow, rhigh; 


а = u >> 16; b = ц ἃ OxFFFF; 
с = v > 16; а = v & OxFFFF; 
р = а*с; 
а = b*d; 


rlow = (-а + b)*(c - d); 

rhigh = (int)((-a + b)^(c- d)) >> 31; 

if (rlow == 0) rhigh = 0; // Correction. 

q = q + (q >> 16); // Overflow cannot occur here. 
rlow = rlow + p; 

if (rlow < p) rhigh = rhigh + 1; 

rlow = rlow + q; 

if (rlow < q) rhigh = rhigh + 1; 


return p + (rlow >> 16) + (rhigh << 16); 
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q+ ((p * q * rlow) << 16) 


unsigned mulhu(unsigned u, unsigned v) ( 
unsigned a, b, с, d, p, а, x, y, rlow, rhigh, t; 


=u >> 16; b= u & OxFFFF; 
=v >> 16; d = v & OxFFFF; 


a*c; 
b*d; 
-a * b; 


(x ^ y) & (rlow -rlow); 
(int)rhigh »» 31; 


+ (q >> 16); // Overflow cannot occur here. 
(rlow & OxFFFF) + (p & OxFFFF) + (q & OxFFFF); 
+= (t >> 16) + (rlow >> 16) + (p >> 16) + (q >> 16); 
+= (rhigh << 16); 
return p; 


ins n,R0,0,1 Clear low-order bit of n. 
li M,0x92492493 Load magic number. 

mulhu q,M,n q = floor(M*n/2**32). 
shri q,q,3 а = q/8. 


shri n,n,l 
li M,0x92492493 
mulhu q,M,n 
shri 4,4,2 


Halve the dividend. 
Load magic number. 


q 
q 


floor(M*n/2**32). 
q/4. 


def magicg(nmax, d): 
nc = (nmax//d)*d - 1 
nbits = int(log(nmax, 2)) + 1 
for p in range(0, 2*nbits - 1): 
if 2**p » nc*(d - (2**p)$d): 


m  (2**p + d - (2**p)$d)//d 
return (m, p) 
print "Can't find p, something is wrong." 
sys.exit(1) 


int icbrt64(unsigned long long x) { 
int s; 


unsigned long long y, b, bs; 


y = 0; 
for (s = 63; в >= 0; s = s - 3) ( 
y = 2*y; 
b = 3*y*(y + 1) + 1; 
bs = b << s; 
if (x >= bs && b == (bs >> s)) í 
x = x - bs; 
у=уД 
} 
} 


return y; 


y = 0; 
if (x >= 0x1000000000000000LL) 4 
if (x >= 0x8000000000000000LL) ( 
x = x - 0x8000000000000000LL; 
y = 2; 
y else í 
x = x - 0x1000000000000000LL; 
Y = 1; 


) 


for (в = 57; в >= O; s = s - 3) 4 


import sys 
import cmath 


num = sys.argv[1:] 

if len(num) == 0: 
print "Converts a base -1 + 1j number, given in decimal" 
print "or hex, to the form a + bj, with a, b real." 
sys.exit() 

num = eval(num[0]) 

r=0 

weight = 1 

while num > 0: 

if num & 1: 
r = r + weight; 

= (-1 + 1j)*weight 


weight 
num = num >> 1; 
print 'r =', r 


000 
001 
011 


0000 
0001 
0011 
0010 
0110 
1110 
1010 
1011 
1001 
1000 


00 
01 
02 
12 
11 
10 
20 
21 
22 
32 
31 
30 


crc - OxFFFFFFFF; 
while (((word = *(unsigned int *)message) & OxFF) !- 0) { 
crc - crc ^ word; 


crc - (crc »» 8) ^ table[crc & OxFF]; 
στο = (crc >> 8) ^ table[crc 8 OxFF]; 
crc = (crc >> 8) ^ table[crc & OxFF]; 
crc = (crc >> 8) ^ table[crc & OxFF]; 


message = message + 4; 


float asqrt(float x0) ( 
union (int ix; float x;); 


x 7 x0; // x can be viewed as int. 


ix = Oxlfbb67a8 + (ix >> 1); // Initial guess. 
x = 0.5f*(x + x0/x); // Newton step. 
return x; 


i = 0x2a51067f + i/3; // Initial guess. 


float acbrt(float x0) ( 
union (int ix; float x;); 


x = x0; // x can be viewed as int. 
ix ix/4 + ix/16; // Approximate divide by 3. 
ix ix + ix/16; 
ix = ix + ix/256; 
ix = 0x2a5137a0 + ix; // Initial guess. 

= 0.33333333f*(2.0f*x + x0/(x*x)); // Newton step. 
return x; 


double rsqrtd(double x0) ( 
union (long long ix; double x;); 


x0; 
0x5fe6ec85e8000000LL - (ix »» 1); 
return x; 


